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1.2 Network Hardware 


Broadly speaking, there are two types of transmission technology that are in widespread use. They are as 
follows: 


1. Broadcast links. 
2. Point-to-point links. 


Broadcast networks have a single communication channel that is shared by all the machines on the network. 
Short messages, called packets in certain contexts, sent by any machine are received by all the others. An 
address field within the packet specifies the intended recipient. Upon receiving a packet, a machine checks the 
address field. If the packet is intended for the receiving machine, that machine processes the packet; if the 
packet is intended for some other machine, it is just ignored. 


Broadcast systems generally also allow the possibility of addressing a packet to all destinations by using a 
special code in the address field. When a packet with this code is transmitted, it is received and processed by 
every machine on the network. This mode of operation is called broadcasting. 

Some broadcast systems also support transmission to a subset of the machines, known as multicasting. 

One possible scheme is to reserve one bit to indicate multicasting. The remaining n - 1 address bits can hold a 
group number. Each machine can "subscribe" to any or all of the groups. When a packet is sent to a certain 
group, it is delivered to all machines subscribing to that group. 


In contrast, point-to-point networks consist of many connections between individual pairs of machines. To go 
from the source to the destination, a packet on this type of network may have to first visit one or more 
intermediate machines. Often multiple routes, of different lengths, are possible, so finding good ones is important 
in point-to-point networks. As a general rule (although there are many exceptions), smaller, geographically 
localized networks tend to use broadcasting, whereas larger networks usually are point-to-point. Point-to-point 
transmission with one sender and one receiver is sometimes called unicasting. 


An alternative criterion for classifying networks is their scale. In Fig. 1-6 we classify multiple processor systems 
by their physical size. At the top are the personal area networks, networks that are meant for one person. For 
example, a wireless network connecting a computer with its mouse, keyboard, and printer is a personal area 
network. Also, a PDA that controls the user's hearing aid or pacemaker fits in this category. Beyond the personal 
area networks come longer-range networks. These can be divided into local, metropolitan, and wide area 
networks. Finally, the connection of two or more networks is called an internetwork. The worldwide Internet 
is a well-known example of an internetwork. 


Figure 1-6. Classification of interconnected processors by scale. 
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1.2.1 Local Area Networks 


Local area networks, generally called LANs, are privately-owned networks within a single building or campus of 
up to a few kilometers in size. They are widely used to connect personal computers and workstations in 
company offices and factories to share resources (e.g., printers) and exchange information. LANs are 
distinguished from other kinds of networks by three characteristics: (1) their size, (2) their transmission 
technology, and (3) their topology. 


LANs are restricted in size, which means that the worst-case transmission time is bounded and known in 
advance. Knowing this bound makes it possible to use certain kinds of designs that would not otherwise be 
possible. It also simplifies network management. 


LANs may use a transmission technology consisting of a cable to which all the machines are attached, like the 
telephone company party lines once used in rural areas. Traditional LANs run at speeds of 10 Mbps to 100 
Mbps, have low delay (microseconds or nanoseconds), and make very few errors. Newer LANs operate at up to 
10 Gbps. In this book, we will adhere to tradition and measure line speeds in megabits/sec (1 Mbps is 1,000,000 
bits/sec) and gigabits/sec (1 Gbps is 1,000,000,000 bits/sec). 


Various topologies are possible for broadcast LANs. Figure 1-7 shows two of them. In a bus (i.e., a linear cable) 
network, at any instant at most one machine is the master and is allowed to transmit. All other machines are 
required to refrain from sending. An arbitration mechanism is needed to resolve conflicts when two or more 
machines want to transmit simultaneously. The arbitration mechanism may be centralized or distributed. IEEE 
802.3, popularly called Ethernet, for example, is a bus-based broadcast network with decentralized control, 
usually operating at 10 Mbps to 10 Gbps. Computers on an Ethernet can transmit whenever they want to; if two 
or more packets collide, each computer just waits a random time and tries again later. 


Figure 1-7. Two broadcast networks. (a) Bus. (b) Ring. 
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A second type of broadcast system is the rina. In a ring, each bit propagates around on its own, not waiting for 
transmit a few bits, often before the complete packet has even been transmitted. As with all other broadcast 
systems, some rule is needed for arbitrating simultaneous accesses to the ring. Various methods, such as 
having the machines take turns, are in use. IEEE 802.5 (the IBM token ring), is a ring-based LAN operating at 4 
and 16 Mbps. FDDI is another example of a ring network. 


Broadcast networks can be further divided into static and dynamic, depending on how the channel is allocated. A 
typical static allocation would be to divide time into discrete intervals and use a round-robin algorithm, allowing 
each machine to broadcast only when its time slot comes up. Static allocation wastes channel capacity when a 
machine has nothing to say during its allocated slot, so most systems attempt to allocate the channel 
dynamically (i.e., on demand). 


Dynamic allocation methods for a common channel are either centralized or decentralized. In the centralized 
channel allocation method, there is a single entity, for example, a bus arbitration unit, which determines who 
goes next. It might do this by accepting requests and making a decision according to some internal algorithm. In 
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the decentralized channel allocation method, there is no central entity; each machine must decide for itself 
whether to transmit. You might think that this always leads to chaos, but it does not. 


1.2.2 Metropolitan Area Networks 


A metropolitan area network, or MAN, covers a city. The best-known example of a MAN is the cable television 
network available in many cities. This system grew from earlier community antenna systems used in areas with 
poor over-the-air television reception. In these early systems, a large antenna was placed on top of a nearby hill 
and signal was then piped to the subscribers' houses. 


At first, these were locally-designed, ad hoc systems. Then companies began jumping into the business, getting 
contracts from city governments to wire up an entire city. The next step was television programming and even 
entire channels designed for cable only. Often these channels were highly specialized, such as all news, all 
sports, all cooking, all gardening, and so on. But from their inception until the late 1990s, they were intended for 
television reception only. 


Starting when the Internet attracted a mass audience, the cable TV network operators began to realize that with 
some changes to the system, they could provide two-way Internet service in unused parts of the spectrum. At 
that point, the cable TV system began to morph from a way to distribute television to a metropolitan area 
network. To a first approximation, a MAN might look something like the system shown in Fig. 1-8. In this figure 
we see both television signals and Internet being fed into the centralized head end for subsequent distribution to 
people's homes. We will come back to this subject in detail in Chap. 2. 


Figure 1-8. A metropolitan area network based on cable TV. 
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Cable television is not the only MAN. Recent developments in high-speed wireless Internet access resulted in 
another MAN, which has been standardized as IEEE 802.16. We will look at this area in Chap. 2. 


1.2.3 Wide Area Networks 


A wide area network, or WAN, spans a large geographical area, often a country or continent. It contains a 
collection of machines intended for running user (i.e., application) programs. We will follow traditional usage and 
call these machines hosts. The hosts are connected by a communication subnet, or just subnet for short. The 
hosts are owned by the customers (e.g., people's personal computers), whereas the communication subnet is 
typically owned and operated by a telephone company or Internet service provider. The job of the subnet is to 
cary messages from host to host, just as the telephone system carries words from speaker to listener. 
Separation of the pure communication aspects of the network (the subnet) from the application aspects (the 
hosts), greatly simplifies the complete network design. 


In most wide area networks, the subnet consists of two distinct components: transmission lines and switching 
elements. Transmission lines move bits between machines. They can be made of copper wire, optical fiber, or 
even radio links. Switching elements are specialized computers that connect three or more transmission lines. 
When data arrive on an incoming line, the switching element must choose an outgoing line on which to forward 
them. These switching computers have been called by various names in the past; the name router is now most 
commonly used. Unfortunately, some people pronounce it "rooter" and others have it rhyme with "doubter." 
Determining the correct pronunciation will be left as an exercise for the reader. (Note: the perceived correct 
answer may depend on where you live.) 


In this model, shown in Fig. 1-9, each host is frequently connected to a LAN on which a router is present, 
although in some cases a host can be connected directly to a router. The collection of communication lines and 
routers (but not the hosts) form the subnet. 


Figure 1-9. Relation between hosts on LANs and the subnet. 
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A short comment about the term "subnet" is in order here. Originally, its only meaning was the collection of 
routers and communication lines that moved packets from the source host to the destination host. However, 
some years later, it also acquired a second meaning in conjunction with network addressing (which we will 
discuss in Chap. 5). Unfortunately, no widely-used alternative exists for its initial meaning, so with some 
hesitation we will use it in both senses. From the context, it will always be clear which is meant. 


In most WANs, the network contains numerous transmission lines, each one connecting a pair of routers. If two 
routers that do not share a transmission line wish to communicate, they must do this indirectly, via other routers. 
When a packet is sent from one router to another via one or more intermediate routers, the packet is received at 
each intermediate router in its entirety, stored there until the required output line is free, and then forwarded. A 
subnet organized according to this principle is called a store-and-forward or packet-switched subnet. Nearly all 
wide area networks (except those using satellites) have store-and-forward subnets. When the packets are small 
and all the same size, they are often called cells. 


The principle of a packet-switched WAN is so important that it is worth devoting a few more words to it. 
Generally, when a process on some host has a message to be sent to a process on some other host, the 
sending host first cuts the message into packets, each one bearing its number in the sequence. These packets 
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are then injected into the network one at a time in quick succession. The packets are transported individually 
over the network and deposited at the receiving host, where they are reassembled into the original message and 
delivered to the receiving process. A stream of packets resulting from some initial message is illustrated in Fig. 
1-10. 


Figure 1-10. A stream of packets from sender to receiver. 
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In this figure, all the packets follow the route ACE, rather than ABDE or ACDE. In some networks all packets 
from a given message must follow the same route; in others each packet is routed separately. Of course, if ACE 
is the best route, all packets may be sent along it, even if each packet is individually routed. 


Routing decisions are made locally. When a packet arrives at router A,itis up to A to decide if this packet should 
be sent on the line to B or the line to C. How A makes that decision is called the routing algorithm. Many of them 
exist. We will study some of them in detail in Chap. 5. 


Not all WANs are packet switched. A second possibility for a WAN is a satellite system. Each router has an 
antenna through which it can send and receive. All routers can hear the output from the satellite, and in some 
cases they can also hear the upward transmissions of their fellow routers to the satellite as well. Sometimes the 
routers are connected to a substantial point-to-point subnet, with only some of them having a satellite antenna. 
Satellite networks are inherently broadcast and are most useful when the broadcast property is important. 


1.2.4 Wireless Networks 


Digital wireless communication is not a new idea. As early as 1901, the Italian physicist Guglielmo Marconi 
demonstrated a ship-to-shore wireless telegraph, using Morse Code (dots and dashes are binary, after all). 
Modern digital wireless systems have better performance, but the basic idea is the same. 


To a first approximation, wireless networks can be divided into three main categories: 


1. System interconnection. 
2. Wireless LANs. 
3. Wireless WANs. 


System interconnection is all about interconnecting the components of a computer using short-range radio. 
Almost every computer has a monitor, keyboard, mouse, and printer connected to the main unit by cables. So 
many new users have a hard time plugging all the cables into the right little holes (even though they are usually 
color coded) that most computer vendors offer the option of sending a technician to the user's home to do it. 
Consequently, some companies got together to design a short-range wireless network called Bluetooth to 
connect these components without wires. Bluetooth also allows digital cameras, headsets, scanners, and other 
devices to connect to a computer by merely being brought within range. No cables, no driver installation, just put 
them down, turn them on, and they work. For many people, this ease of operation is a big plus. 


In the simplest form, system interconnection networks use the master-slave paradigm of Fig. 1-11(a). The 
System unit is normally the master, talking to the mouse, keyboard, etc., as slaves. The master tells the slaves 
what addresses to use, when they can broadcast, how long they can transmit, what frequencies they can use, 
and so on. We will discuss Bluetooth in more detail in Chap. 4. 
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Figure 1-11. (a) Bluetooth configuration. (b) Wireless LAN. 
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The next step up in wireless networking are the wireless LANs. These are systems in which every computer has 
a radio modem and antenna with which it can communicate with other systems. Often there is an antenna on the 
ceiling that the machines talk to, as shown in Fig. 1-11(b). However, if the systems are close enough, they can 
communicate directly with one another in a peer-to-peer configuration. Wireless LANs are becoming increasingly 
common in small offices and homes, where installing Ethernet is considered too much trouble, as well as in older 
office buildings, company cafeterias, conference rooms, and other places. There is a standard for wireless LANs, 
called IEEE 802.11, which most systems implement and which is becoming very widespread. We will discuss it 


in Chap. 4. 


The third kind of wireless network is used in wide area systems. The radio network used for cellular telephones 
is an example of a low-bandwidth wireless system. This system has already gone through three generations. 
The first generation was analog and for voice only. The second generation was digital and for voice only. The 
third generation is digital and is for both voice and data. In a certain sense, cellular wireless networks are like 
wireless LANs, except that the distances involved are much greater and the bit rates much lower. Wireless LANs 
can operate at rates up to about 50 Mbps over distances of tens of meters. Cellular systems operate below 1 
Mbps, but the distance between the base station and the computer or telephone is measured in kilometers 
rather than in meters. We will have a lot to say about these networks in Chap. 2. 


In addition to these low-speed networks, high-bandwidth wide area wireless networks are also being developed. 
The initial focus is high-speed wireless Internet access from homes and businesses, bypassing the telephone 
system. This service is often called local multipoint distribution service. We will study it later in the book. A 
standard for it, called IEEE 802.16, has also been developed. We will examine the standard in Chap. 4. 


Almost all wireless networks hook up to the wired network at some point to provide access to files, databases, 
and the Internet. There are many ways these connections can be realized, depending on the circumstances. For 
example, in Fig. 1-12(a), we depict an airplane with a number of people using modems and seat-back 
telephones to call the office. Each call is independent of the other ones. A much more efficient option, however, 
is the flying LAN of Fig. 1-12(b). Here each seat comes equipped with an Ethernet connector into which 
passengers can plug their computers. A single router on the aircraft maintains a radio link with some router on 
the ground, changing routers as it flies along. This configuration is just a traditional LAN, except that its 
connection to the outside world happens to be a radio link instead of a hardwired line. 


Figure 1-12. (a) Individual mobile computers. (b) A flying LAN. 
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Many people believe wireless is the wave of the future (e.g., Bi et al., 2001; Leeper, 2001; Varshey and Vetter, 
2000) but at least one dissenting voice has been heard. Bob Metcalfe, the inventor of Ethernet, has written: 
"Mobile wireless computers are like mobile pipeless bathrooms—portapotties. They will be common on vehicles, 
and at construction sites, and rock concerts. My advice is to wire up your home and stay there" (Metcalfe, 1995). 
History may record this remark in the same category as IBM's chairman T.J. Watson's 1945 explanation of why 
IBM was not getting into the computer business: "Four or five computers should be enough for the entire world 
until the year 2000." 


1.2.5 Home Networks 


Home networking is on the horizon. The fundamental idea is that in the future most homes will be set up for 
networking. Every device in the home will be capable of communicating with every other device, and all of them 
will be accessible over the Internet. This is one of those visionary concepts that nobody asked for (like TV 
remote controls or mobile phones), but once they arrived nobody can imagine how they lived without them. 


Many devices are capable of being networked. Some of the more obvious categories (with examples) are as 
follows: 


Computers (desktop PC, notebook PC, PDA, shared peripherals). 
Entertainment (TV, DVD, VCR, camcorder, camera, stereo, MP3). 
Telecommunications (telephone, mobile telephone, intercom, fax). 
Appliances (microwave, refrigerator, clock, furnace, airco, lights). 
Telemetry (utility meter, smoke/burglar alarm, thermostat, babycam). 


Qi, Co DO 


Home computer networking is already here in a limited way. Many homes already have a device to connect 
multiple computers to a fast Internet connection. Networked entertainment is not quite here, but as more and 
more music and movies can be downloaded from the Internet, there will be a demand to connect stereos and 
televisions to it. Also, people will want to share their own videos with friends and family, so the connection will 
need to go both ways. Telecommunications gear is already connected to the outside world, but soon it will be 
digital and go over the Internet. The average home probably has a dozen clocks (e.g., in appliances), all of 
which have to be reset twice a year when daylight saving time (summer time) comes and goes. If all the clocks 
were on the Internet, that resetting could be done automatically. Finally, remote monitoring of the home and its 
contents is a likely winner. Probably many parents would be willing to spend some money to monitor their 
sleeping babies on their PDAs when they are eating out, even with a rented teenager in the house. While one 
can imagine a separate network for each application area, integrating all of them into a single network is 
probably a better idea. 


Home networking has some fundamentally different properties than other network types. First, the network and 
devices have to be easy to install. The author has installed numerous pieces of hardware and software on 
various computers over the years, with mixed results. A series of phone calls to the vendor's helpdesk typically 
resulted in answers like (1) Read the manual, (2) Reboot the computer, (3) Remove all hardware and software 
except ours and try again, (4) Download the newest driver from our Web site, and if all else fails, (5) Reformat 
the hard disk and then reinstall Windows from the CD-ROM. Telling the purchaser of an Internet refrigerator to 
download and install a new version of the refrigerator's operating system is not going to lead to happy 
customers. Computer users are accustomed to putting up with products that do not work; the car-, television-, 
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and refrigerator-buying public is far less tolerant. They expect products to work for 10096 from the word go. 


Second, the network and devices have to be foolproof in operation. Air conditioners used to have one knob with 
four settings: OFF, LOW, MEDIUM, and HIGH. Now they have 30-page manuals. Once they are networked, 
expect the chapter on security alone to be 30 pages. This will be beyond the comprehension of virtually all the 
users. 


Third, low price is essential for success. People will not pay a $50 premium for an Internet thermostat because 
few people regard monitoring their home temperature from work that important. For $5 extra, it might sell, 
though. 


Fourth, the main application is likely to involve multimedia, so the network needs sufficient capacity. There is no 
market for Internet-connected televisions that show shaky movies at 320 x 240 pixel resolution and 10 
frames/sec. Fast Ethernet, the workhorse in most offices, is not good enough for multimedia. Consequently, 
home networks will need better performance than that of existing office networks and at lower prices before they 
become mass market items. 


Fifth, it must be possible to start out with one or two devices and expand the reach of the network gradually. This 
means no format wars. Telling consumers to buy peripherals with IEEE 1394 (FireWire) interfaces and a few 
years later retracting that and saying USB 2.0 is the interface-of-the-month is going to make consumers skittish. 
The network interface will have to remain stable for many years; the wiring (if any) will have to remain stable for 
decades. 


Sixth, security and reliability will be very important. Losing a few files to an e-mail virus is one thing; having a 
burglar disarm your security system from his PDA and then plunder your house is something quite different. 


An interesting question is whether home networks will be wired or wireless. Most homes already have six 
networks installed: electricity, telephone, cable television, water, gas, and sewer. Adding a seventh one during 
construction is not difficult, but retrofitting existing houses is expensive. Cost favors wireless networking, but 
security favors wired networking. The problem with wireless is that the radio waves they use are quite good at 
going through fences. Not everyone is overjoyed at the thought of having the neighbors piggybacking on their 
Internet connection and reading their e-mail on its way to the printer. In Chap. 8 we will study how encryption 
can be used to provide security, but in the context of a home network, security has to be foolproof, even with 
inexperienced users. This is easier said than done, even with highly sophisticated users. 


In short, home networking offers many opportunities and challenges. Most of them relate to the need to be easy 
to manage, dependable, and secure, especially in the hands of nontechnical users, while at the same time 
delivering high performance at low cost. 


1.2.6 Internetworks 


Many networks exist in the world, often with different hardware and software. People connected to one network 
often want to communicate with people attached to a different one. The fulfillment of this desire requires that 
different, and frequently incompatible networks, be connected, sometimes by means of machines called 
gateways to make the connection and provide the necessary translation, both in terms of hardware and 
software. A collection of interconnected networks is called an internetwork or internet. These terms will be used 
in a generic sense, in contrast to the worldwide Internet (which is one specific internet), which we will always 
capitalize. 


A common form of internet is a collection of LANs connected by a WAN. In fact, if we were to replace the label 
"subnet" in Fig. 1-9 by "WAN," nothing else in the figure would have to change. The only real technical 
distinction between a subnet and a WAN in this case is whether hosts are present. If the system within the gray 
area contains only routers, it is a subnet; if it contains both routers and hosts, it is a WAN. The real differences 
relate to ownership and use. 


Subnets, networks, and internetworks are often confused. Subnet makes the most sense in the context of a wide 
area network, where it refers to the collection of routers and communication lines owned by the network 
operator. As an analogy, the telephone system consists of telephone switching offices connected to one another 
by high-speed lines, and to houses and businesses by low-speed lines. These lines and equipment, owned and 
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managed by the telephone company, form the subnet of the telephone system. The telephones themselves (the 


hosts in this analogy) are not part of the subnet. The combination of a subnet and its hosts forms a network. In 
the case of a LAN, the cable and the hosts form the network. There really is no subnet. 


An internetwork is formed when distinct networks are interconnected. In our view, connecting a LAN and a WAN 
or connecting two LANs forms an internetwork, but there is little agreement in the industry over terminology in 
this area. One rule of thumb is that if different organizations paid to construct different parts of the network and 
each maintains its part, we have an internetwork rather than a single network. Also, if the underlying technology 
is different in different parts (e.g., broadcast versus point-to-point), we probably have two networks. 


1.3 Network Software 


The first computer networks were designed with the hardware as the main concern and the software as an 
afterthought. This strategy no longer works. Network software is now highly structured. In the following sections 
we examine the software structuring technique in some detail. The method described here forms the keystone of 
the entire book and will occur repeatedly later on. 


1.3.1 Protocol Hierarchies 


To reduce their design complexity, most networks are organized as a stack of layers or levels, each one built 
upon the one below it. The number of layers, the name of each layer, the contents of each layer, and the 
function of each layer differ from network to network. The purpose of each layer is to offer certain services to the 
higher layers, shielding those layers from the details of how the offered services are actually implemented. In a 
sense, each layer is a kind of virtual machine, offering certain services to the layer above it. 


This concept is actually a familiar one and used throughout computer science, where it is variously known as 
information hiding, abstract data types, data encapsulation, and object-oriented programming. The fundamental 
idea is that a particular piece of software (or hardware) provides a service to its users but keeps the details of its 
internal state and algorithms hidden from them. 


Layer n on one machine carries on a conversation with layer n on another machine. The rules and conventions 
used in this conversation are collectively known as the layer n protocol. Basically, a protocol is an agreement 
between the communicating parties on how communication is to proceed. As an analogy, when a woman is 
introduced to a man, she may choose to stick out her hand. He, in turn, may decide either to shake it or kiss it, 
depending, for example, on whether she is an American lawyer at a business meeting or a European princess at 
a formal ball. Violating the protocol will make communication more difficult, if not completely impossible. 


A five-layer network is illustrated in Fig. 1-13. The entities comprising the corresponding layers on different 
machines are called peers. The peers may be processes, hardware devices, or even human beings. In other 
Words, it is the peers that communicate by using the protocol. 

Figure 1-13. Layers, protocols, and interfaces. 
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In reality, no data are directly transferred from layer n on one machine to layer n on another machine. Instead, 
each layer passes data and control information to the layer immediately below it, until the lowest layer is 
reached. Below layer 1 is the physical medium through which actual communication occurs. 


In Fig. 1-13, virtual communication is shown by dotted lines and physical communication by solid lines. 


Between each pair of adjacent layers is an interface. The interface defines which primitive operations and 
services the lower layer makes available to the upper one. When network designers decide how many layers to 
include in a network and what each one should do, one of the most important considerations is defining clean 
interfaces between the layers. Doing so, in turn, requires that each layer perform a specific collection of well- 
understood functions. In addition to minimizing the amount of information that must be passed between layers, 
clear-cut interfaces also make it simpler to replace the implementation of one layer with a completely different 
implementation (e.g., all the telephone lines are replaced by satellite channels) because all that is required of the 
new implementation is that it offer exactly the same set of services to its upstairs neighbor as the old 
implementation did. In fact, it is common that different hosts use different implementations. 


A set of layers and protocols is called a network architecture. The specification of an architecture must contain 
enough information to allow an implementer to write the program or build the hardware for each layer so that it 
will correctly obey the appropriate protocol. Neither the details of the implementation nor the specification of the 
interfaces is part of the architecture because these are hidden away inside the machines and not visible from the 
outside. It is not even necessary that the interfaces on all machines in a network be the same, provided that 
each machine can correctly use all the protocols. A list of protocols used by a certain system, one protocol per 
layer, is called a protocol stack. The subjects of network architectures, protocol stacks, and the protocols 
themselves are the principal topics of this book. 


An analogy may help explain the idea of multilayer communication. Imagine two philosophers (peer processes in 
layer 3), one of whom speaks Urdu and English and one of whom speaks Chinese and French. Since they have 
no common language, they each engage a translator (peer processes at layer 2), each of whom in turn contacts 
a secretary (peer processes in layer 1). Philosopher 1 wishes to convey his affection for oryctolagus cuniculus to 
his peer. To do so, he passes a message (in English) across the 2/3 interface to his translator, saying "I like 
rabbits," as illustrated in Fig. 1-14. The translators have agreed on a neutral language known to both of them, 
Dutch, so the message is converted to "Ik vind konijnen leuk." The choice of language is the layer 2 protocol and 
is up to the layer 2 peer processes. 


Figure 1-14. The philosopher-translator-secretary architecture. 
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The translator then gives the message to a secretary for transmission, by, for example, fax (the layer 1 protocol). 
When the message arrives, it is translated into French and passed across the 2/3 interface to philosopher 2. 
Note that each protocol is completely independent of the other ones as long as the interfaces are not changed. 
The translators can switch from Dutch to say, Finnish, at will, provided that they both agree, and neither changes 
his interface with either layer 1 or layer 3. Similarly, the secretaries can switch from fax to e-mail or telephone 
without disturbing (or even informing) the other layers. Each process may add some information intended only 
for its peer. This information is not passed upward to the layer above. 


Now consider a more technical example: how to provide communication to the top layer of the five-layer network 
in Fig. 1-15. A message, M, is produced by an application process running in layer 5 and given to layer 4 for 
transmission. Layer 4 puts a header in front of the message to identify the message and passes the result to 
layer 3. The header includes control information, such as sequence numbers, to allow layer 4 on the destination 
machine to deliver messages in the right order if the lower layers do not maintain sequence. In some layers, 
headers can also contain sizes, times, and other control fields. 


Figure 1-15. Example information flow supporting virtual communication in layer 5. 
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In many networks, there is no limit to the size of messages transmitted in the layer 4 protocol, but there is nearly 
always a limit imposed by the layer 3 protocol. Consequently, layer 3 must break up the incoming messages into 
smaller units, packets, prepending a layer 3 header to each packet. In this example, M is split into two parts, M: 
and M». 


Layer 3 decides which of the outgoing lines to use and passes the packets to layer 2. Layer 2 adds not only a 
header to each piece, but also a trailer, and gives the resulting unit to layer 1 for physical transmission. At the 
receiving machine the message moves upward, from layer to layer, with headers being stripped off as it 
progresses. None of the headers for layers below n are passed up to layer n. 


The important thing to understand about Fig. 1-15 is the relation between the virtual and actual communication 
and the difference between protocols and interfaces. The peer processes in layer 4, for example, conceptually 
think of their communication as being "horizontal," using the layer 4 protocol. Each one is likely to have a 
procedure called something like Send To Other Side and Get From Other Side, even though these procedures 
actually communicate with lower layers across the 3/4 interface, not with the other side. 


The peer process abstraction is crucial to all network design. Using it, the unmanageable task of designing the 
complete network can be broken into several smaller, manageable design problems, namely, the design of the 
individual layers. 


Although Sec. 1.3 is called "Network 1.3," it is worth pointing out that the lower layers of a protocol hierarchy are 
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frequently implemented in hardware or firmware. Nevertheless, complex protocol algorithms are involved, even if 
they are embedded (in whole or in part) in hardware. 


1.3.2 Design Issues for the Layers 


Some of the key design issues that occur in computer networks are present in several layers. Below, we will 
briefly mention some of the more important ones. 


Every layer needs a mechanism for identifying senders and receivers. Since a network normally has many 
computers, some of which have multiple processes, a means is needed for a process on one machine to specify 
with whom it wants to talk. As a consequence of having multiple destinations, some form of addressing is 
needed in order to specify a specific destination. 


Another set of design decisions concerns the rules for data transfer. In some systems, data only travel in one 
direction; in others, data can go both ways. The protocol must also determine how many logical channels the 


connection corresponds to and what their priorities are. Many networks provide at least two logical channels per 
connection, one for normal data and one for urgent data. 


Error control is an important issue because physical communication circuits are not perfect. Many error-detecting 
and error-correcting codes are known, but both ends of the connection must agree on which one is being used. 
In addition, the receiver must have some way of telling the sender which messages have been correctly received 
and which have not. 


Not all communication channels preserve the order of messages sent on them. To deal with a possible loss of 
sequencing, the protocol must make explicit provision for the receiver to allow the pieces to be reassembled 
properly. An obvious solution is to number the pieces, but this solution still leaves open the question of what 
should be done with pieces that arrive out of order. 


An issue that occurs at every level is how to keep a fast sender from swamping a slow receiver with data. 
Various solutions have been proposed and will be discussed later. Some of them involve some kind of feedback 
from the receiver to the sender, either directly or indirectly, about the receiver's current situation. Others limit the 
sender to an agreed-on transmission rate. This subject is called flow control. 


Another problem that must be solved at several levels is the inability of all processes to accept arbitrarily long 
messages. This property leads to mechanisms for disassembling, transmitting, and then reassembling 
messages. A related issue is the problem of what to do when processes insist on transmitting data in units that 
are so small that sending each one separately is inefficient. Here the solution is to gather several small 
messages heading toward a common destination into a single large message and dismember the large 
message at the other side. 


When it is inconvenient or expensive to set up a separate connection for each pair of communicating processes, 
the underlying layer may decide to use the same connection for multiple, unrelated conversations. As long as 
this multiplexing and demultiplexing is done transparently, it can be used by any layer. Multiplexing is needed in 
the physical layer, for example, where all the traffic for all connections has to be sent over at most a few physical 
circuits. 


When there are multiple paths between source and destination, a route must be chosen. Sometimes this 
decision must be split over two or more layers. For example, to send data from London to Rome, a high-level 
decision might have to be made to transit France or Germany based on their respective privacy laws. Then a 
low-level decision might have to made to select one of the available circuits based on the current traffic load. 
This topic is called routing. 
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1.3.3 Connection-Oriented and Connectionless Services 


Layers can offer two different types of service to the layers above them: connection-oriented and connectionless. 
In this section we will look at these two types and examine the differences between them. 


Connection-oriented service is modeled after the telephone system. To talk to someone, you pick up the phone, 
dial the number, talk, and then hang up. Similarly, to use a connection-oriented network service, the service user 
first establishes a connection, uses the connection, and then releases the connection. The essential aspect of a 
connection is that it acts like a tube: the sender pushes objects (bits) in at one end, and the receiver takes them 
out at the other end. In most cases the order is preserved so that the bits arrive in the order they were sent. 


In some cases when a connection is established, the sender, receiver, and subnet conduct a negotiation about 
parameters to be used, such as maximum message size, quality of service required, and other issues. Typically, 
one side makes a proposal and the other side can accept it, reject it, or make a counterproposal. 


In contrast connectionless service is modeled after the postal system. Each message (letter) carries the full 
destination address, and each one is routed through the system independent of all the others. Normally, when 
two messages are sent to the same destination, the first one sent will be the first one to arrive. However, it is 
possible that the first one sent can be delayed so that the second one arrives first. 


Each service can be characterized by a quality of service. Some services are reliable in the sense that they 
never lose data. Usually, a reliable service is implemented by having the receiver acknowledge the receipt of 
each message so the sender is sure that it arrived. The acknowledgement process introduces overhead and 
delays, which are often worth it but are sometimes undesirable. 


A typical situation in which a reliable connection-oriented service is appropriate is file transfer. The owner of the 
file wants to be sure that all the bits arrive correctly and in the same order they were sent. Very few file transfer 
customers would prefer a service that occasionally scrambles or loses a few bits, even if it is much faster. 


Reliable connection-oriented service has two minor variations: message sequences and byte streams. In the 
former variant, the message boundaries are preserved. When two 1024-byte messages are sent, they arrive as 
two distinct 1024-byte messages, never as one 2048-byte message. In the latter, the connection is simply a 
stream of bytes, with no message boundaries. When 2048 bytes arrive at the receiver, there is no way to tell if 
they were sent as one 2048-byte message, two 1024-byte messages, or 2048 1-byte messages. If the pages of 
a book are sent over a network to a phototypesetter as separate messages, it might be important to preserve the 
message boundaries. On the other hand, when a user logs into a remote server, a byte stream from the user's 
computer to the server is all that is needed. Message boundaries are not relevant. 


As mentioned above, for some applications, the transit delays introduced by acknowledgements are 
unacceptable. One such application is digitized voice traffic. It is preferable for telephone users to hear a bit of 
noise on the line from time to time than to experience a delay waiting for acknowledgements. Similarly, when 
transmitting a video conference, having a few pixels wrong is no problem, but having the image jerk along as the 
flow stops to correct errors is irritating. 


Not all applications require connections. For example, as electronic mail becomes more common, electronic junk 
is becoming more common too. The electronic junk-mail sender probably does not want to go to the trouble of 
setting up and later tearing down a connection just to send one item. Nor is 100 percent reliable delivery 
essential, especially if it costs more. All that is needed is a way to send a single message that has a high 
probability of arrival, but no guarantee. Unreliable (meaning not acknowledged) connectionless service is often 
called datagram service, in analogy with telegram service, which also does not return an acknowledgement to 
the sender. 


In other situations, the convenience of not having to establish a connection to send one short message is 
desired, but reliability is essential. The acknowledged datagram service can be provided for these applications. It 
is like sending a registered letter and requesting a return receipt. When the receipt comes back, the sender is 
absolutely sure that the letter was delivered to the intended party and not lost along the way. 


Still another service is the request-reply service. In this service the sender transmits a single datagram 
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containing a request; the reply contains the answer. For example, a query to the local library asking where 
Uighur is spoken falls into this category. Request-reply is commonly used to implement communication in the 
client-server model: the client issues a request and the server responds to it. Figure 1-16 summarizes the types 
of services discussed above. 


Figure 1-16. Six different types of service. 


Service Example 
Comnection- Reliable message stream | Sequence of pages 
oriented Reliable byte stream Remote login 
Unreliable connection | Digitized voice 
Unreliable datagram | Electronic junk mail 
— Acknowledged datagram | Registered mail 
Request-reply | Database query 


The concept of using unreliable communication may be confusing at first. After all, why would anyone actually 
prefer unreliable communication to reliable communication? First of all, reliable communication (in our sense, 
that is, acknowledged) may not be available. For example, Ethernet does not provide reliable communication. 
Packets can occasionally be damaged in transit. It is up to higher protocol levels to deal with this problem. 
Second, the delays inherent in providing a reliable service may be unacceptable, especially in real-time 
applications such as multimedia. For these reasons, both reliable and unreliable communication coexist. 


1.3.4 Service Primitives 


A service is formally specified by a set of primitives (operations) available to a user process to access the 
service. These primitives tell the service to perform some action or report on an action taken by a peer entity. If 
the protocol stack is located in the operating system, as it often is, the primitives are normally system calls. 
These calls cause a trap to kernel mode, which then turns control of the machine over to the operating system to 
send the necessary packets. 


The set of primitives available depends on the nature of the service being provided. The primitives for 
connection-oriented service are different from those of connectionless service. As a minimal example of the 
Service primitives that might be provided to implement a reliable byte stream in a client-server environment, 
consider the primitives listed in Fig. 1-17. 


Figure 1-17. Five service primitives for implementing a simple connection-oriented service. 


Primitive Meaning 
LISTEN Block waiting for an incoming connection 
CONNECT Establish a connection with a waiting peer 
RECEIVE Block waiting for an incoming message 
SEND Send a message to the peer 
DISCONNECT Terminate a connection 


These primitives might be used as follows. First, the server executes LISTEN to indicate that it is prepared to 
accept incoming connections. A common way to implement LISTEN is to make it a blocking system call. After 
executing the primitive, the server process is blocked until a request for connection appears. 


Next, the client process executes CONNECT to establish a connection with the server. The CONNECT call 
needs to specify who to connect to, so it might have a parameter giving the server's address. The operating 
system then typically sends a packet to the peer asking it to connect, as shown by (1) in Fig. 1-18. The client 
process is suspended until there is a response. When the packet arrives at the server, it is processed by the 
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operating system there. When the system sees that the packet is requesting a connection, it checks to see if 
there is a listener. If so, it does two things: unblocks the listener and sends back an acknowledgement (2). The 
arrival of this acknowledgement then releases the client. At this point the client and server are both running and 
they have a connection established. It is important to note that the acknowledgement (2) is generated by the 
protocol code itself, not in response to a user-level primitive. If a connection request arrives and there is no 
listener, the result is undefined. In some systems the packet may be queued for a short time in anticipation of a 
LISTEN. 


Figure 1-18. Packets sent in a simple client-server interaction on a connection-oriented network. 
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The obvious analogy between this protocol and real life is a customer (client) calling a company's customer 
service manager. The service manager starts out by being near the telephone in case it rings. Then the client 
places the call. When the manager picks up the phone, the connection is established. 


The next step is for the server to execute RECEIVE to prepare to accept the first request. Normally, the server 
does this immediately upon being released from the LISTEN, before the acknowledgement can get back to the 
client. The RECEIVE call blocks the server. 


Then the client executes SEND to transmit its request (3) followed by the execution of RECEIVE to get the reply. 


The arrival of the request packet at the server machine unblocks the server process so it can process the 
request. After it has done the work, it uses SEND to return the answer to the client (4). The arrival of this packet 
unblocks the client, which can now inspect the answer. If the client has additional requests, it can make them 
now. If it is done, it can use DISCONNECT to terminate the connection. Usually, an initial DISCONNECT is a 
blocking call, suspending the client and sending a packet to the server saying that the connection is no longer 
needed (5). When the server gets the packet, it also issues a DISCONNECT of its own, acknowledging the client 
and releasing the connection. When the server's packet (6) gets back to the client machine, the client process is 
released and the connection is broken. In a nutshell, this is how connection-oriented communication works. 


Of course, life is not so simple. Many things can go wrong here. The timing can be wrong (e.g., the CONNECT is 
done before the LISTEN), packets can get lost, and much more. We will look at these issues in great detail later, 
but for the moment, Fig. 1-18 briefly summarizes how client-server communication might work over a 
connection-oriented network. 


Given that six packets are required to complete this protocol, one might wonder why a connectionless protocol is 
not used instead. The answer is that in a perfect world it could be, in which case only two packets would be 
needed: one for the request and one for the reply. However, in the face of large messages in either direction 
(e.g., a megabyte file), transmission errors, and lost packets, the situation changes. If the reply consisted of 
hundreds of packets, some of which could be lost during transmission, how would the client know if some pieces 
were missing? How would the client know whether the last packet actually received was really the last packet 
sent? Suppose that the client wanted a second file. How could it tell packet 1 from the second file from a lost 
packet 1 from the first file that suddenly found its way to the client? In short, in the real world, a simple request- 
reply protocol over an unreliable network is often inadequate. In Chap. 3 we will study a variety of protocols in 
detail that overcome these and other problems. For the moment, suffice it to say that having a reliable, ordered 
byte stream between processes is sometimes very convenient. 


1.3.5 The Relationship of Services to Protocols 


Services and protocols are distinct concepts, although they are frequently confused. This distinction is so 
important, however, that we emphasize it again here. A service is a set of primitives (operations) that a layer 
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provides to the layer above it. The service defines what operations the layer is prepared to perform on behalf of 
its users, but it says nothing at all about how these operations are implemented. A service relates to an interface 
between two layers, with the lower layer being the service provider and the upper layer being the service user. 


A protocol, in contrast, is a set of rules governing the format and meaning of the packets, or messages that are 
exchanged by the peer entities within a layer. Entities use protocols to implement their service definitions. They 
are free to change their protocols at will, provided they do not change the service visible to their users. In this 
way, the service and the protocol are completely decoupled. 


In other words, services relate to the interfaces between layers, as illustrated in Fig. 1-19. In contrast, protocols 
relate to the packets sent between peer entities on different machines. It is important not to confuse the two 
concepts. 


Figure 1-19. The relationship between a service and a protocol. 
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An analogy with programming languages is worth making. A service is like an abstract data type or an object in 
an object-oriented language. It defines operations that can be performed on an object but does not specify how 
these operations are implemented. A protocol relates to the implementation of the service and as such is not 
visible to the user of the service. 


Many older protocols did not distinguish the service from the protocol. In effect, a typical layer might have had a 
service primitive SEND PACKET with the user providing a pointer to a fully assembled packet. This arrangement 
meant that all changes to the protocol were immediately visible to the users. Most network designers now regard 
such a design as a serious blunder. 


1.4 Reference Models 


Now that we have discussed layered networks in the abstract, it is time to look at some examples. In the next 
two sections we will discuss two important network architectures, the OSI reference model and the TCP/IP 
reference model. Although the protocols associated with the OSI model are rarely used any more, the model 
itself is actually quite general and still valid, and the features discussed at each layer are still very important. The 
TCP/IP model has the opposite properties: the model itself is not of much use but the protocols are widely used. 
For this reason we will look at both of them in detail. Also, sometimes you can learn more from failures than from 
successes. 


1.4.1 The OSI Reference Model 


The OSI model (minus the physical medium) is shown in Fig. 1-20. This model is based on a proposal developed 
by the International Standards Organization (ISO) as a first step toward international standardization of the 
protocols used in the various layers (Day and Zimmermann, 1983). It was revised in 1995 (Day, 1995). The 
model is called the ISO OSI (Open Systems Interconnection) Reference Model because it deals with connecting 
open systems—that is, systems that are open for communication with other systems. We will just call it the OSI 
model for short. 
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Figure 1-20. The OSI reference model. 
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The OSI model has seven layers. The principles that were applied to arrive at the seven layers can be briefly 
summarized as follows: 


A layer should be created where a different abstraction is needed. 
Each layer should perform a well-defined function. 
The function of each layer should be chosen with an eye toward defining internationally standardized 


protocols. 

The layer boundaries should be chosen to minimize the information flow across the interfaces. 

The number of layers should be large enough that distinct functions need not be thrown together in the 
same layer out of necessity and small enough that the architecture does not become unwieldy. 


O= 


o> 


Below we will discuss each layer of the model in turn, starting at the bottom layer. Note that the OSI model itself 
is not a network architecture because it does not specify the exact services and protocols to be used in each 
layer. It just tells what each layer should do. However, ISO has also produced standards for all the layers, 
although these are not part of the reference model itself. Each one has been published as a separate 
international standard. 


The Physical Layer 


The physical layer is concerned with transmitting raw bits over a communication channel. The design issues 
have to do with making sure that when one side sends a 1 bit, it is received by the other side as a 1 bit, not as a 
0 bit. Typical questions here are how many volts should be used to represent a 1 and how many for a 0, how 
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The design issues here largely deal with mechanical, 
electrical, and timing interfaces, and the physical transmission medium, which lies below the physical layer. 


many nanoseconds a bit lasts, whether transmission may proceed simultaneously in both directions, how the 
initial connection is established and how it is torn down when both sides are finished, and how many pins the 
network connector has and what each pin is used for. 


The Data Link Layer 


The main task of the data link layer is to transform a raw transmission facility into a line that appears free of 
undetected transmission errors to the network layer. It accomplishes this task by having the sender break up the 
input data into data frames (typically a few hundred or a few thousand bytes) and transmit the frames 
sequentially. If the service is reliable, the receiver confirms correct receipt of each frame by sending back an 
acknowledgement frame. 


Another issue that arises in the data link layer (and most of the higher layers as well) is how to keep a fast 
transmitter from drowning a slow receiver in data. Some traffic regulation mechanism is often needed to let the 
transmitter know how much buffer space the receiver has at the moment. Frequently, this flow regulation and the 
error handling are integrated. 


Broadcast networks have an additional issue in the data link layer: how to control access to the shared channel. 
A special sublayer of the data link layer, the medium access control sublayer, deals with this problem. 


The Network Layer 


The network layer controls the operation of the subnet. A key design issue is determining how packets are 
routed from source to destination. Routes can be based on static tables that are "wired into" the network and 
rarely changed. They can also be determined at the start of each conversation, for example, a terminal session 
(e.g., a login to a remote machine). Finally, they can be highly dynamic, being determined anew for each packet, 
to reflect the current network load. 


If too many packets are present in the subnet at the same time, they will get in one another's way, forming 
bottlenecks. The control of such congestion also belongs to the network layer. More generally, the quality of 
service provided (delay, transit time, jitter, etc.) is also a network layer issue. 


When a packet has to travel from one network to another to get to its destination, many problems can arise. The 
addressing used by the second network may be different from the first one. The second one may not accept the 
packet at all because it is too large. The protocols may differ, and so on. It is up to the network layer to overcome 
all these problems to allow heterogeneous networks to be interconnected. 


In broadcast networks, the routing problem is simple, so the network layer is often thin or even nonexistent. 


The Transport Layer 


The basic function of the transport layer is to accept data from above, split it up into smaller units if need be, 
pass these to the network layer, and ensure that the pieces all arrive correctly at the other end. Furthermore, all 
this must be done efficiently and in a way that isolates the upper layers from the inevitable changes in the 
hardware technology. 


The transport layer also determines what type of service to provide to the session layer, and, ultimately, to the 
users of the network. The most popular type of transport connection is an error-free point-to-point channel that 
delivers messages or bytes in the order in which they were sent. However, other possible kinds of transport 
Service are the transporting of isolated messages, with no guarantee about the order of delivery, and the 
broadcasting of messages to multiple destinations. The type of service is determined when the connection is 
established. (As an aside, an error-free channel is impossible to achieve; what people really mean by this term is 
that the error rate is low enough to ignore in practice.) 


The transport layer is a true end-to-end layer, all the way from the source to the destination. In other words, a 

program on the source machine carries on a conversation with a similar program on the destination machine, 

using the message headers and control messages. In the lower layers, the protocols are between each machine 

and its immediate neighbors, and not between the ultimate source and destination machines, which may be 
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separated by many routers. The difference between layers 1 through 3, which are chained, and layers 4 through 
7, Which are end-to-end, is illustrated in Fig. 1-20. 


The Session Layer 


The session layer allows users on different machines to establish sessions between them. Sessions offer 
various services, including dialog control (keeping track of whose turn it is to transmit), token management 
(preventing two parties from attempting the same critical operation at the same time), and synchronization 
(checkpointing long transmissions to allow them to continue from where they were after a crash). 


The Presentation Layer 


Unlike lower layers, which are mostly concerned with moving bits around, the presentation layer is concerned 
with the syntax and semantics of the information transmitted. In order to make it possible for computers with 
different data representations to communicate, the data structures to be exchanged can be defined in an 
abstract way, along with a standard encoding to be used "on the wire." The presentation layer manages these 
abstract data structures and allows higher-level data structures (e.g., banking records), to be defined and 
exchanged. 


The Application Layer 


The application layer contains a variety of protocols that are commonly needed by users. One widely-used 
application protocol is HTTP (HyperText Transfer Protocol), which is the basis for the World Wide Web. When a 
browser wants a Web page, it sends the name of the page it wants to the server using HTTP. The server then 
sends the page back. Other application protocols are used for file transfer, electronic mail, and network news. 


1.4.2 The TCP/IP Reference Model 


Let us now turn from the OSI reference model to the reference model used in the grandparent of all wide area 
computer networks, the ARPANET, and its successor, the worldwide Internet. Although we will give a brief 
history of the ARPANET later, it is useful to mention a few key aspects of it now. The ARPANET was a research 
network sponsored by the DoD (U.S. Department of Defense). It eventually connected hundreds of universities 
and government installations, using leased telephone lines. When satellite and radio networks were added later, 
the existing protocols had trouble interworking with them, so a new reference architecture was needed. Thus, 
the ability to connect multiple networks in a seamless way was one of the major design goals from the very 
beginning. This architecture later became known as the TCP/IP Reference Model, after its two primary protocols. 
It was first defined in (Cerf and Kahn, 1974). A later perspective is given in (Leiner et al., 1985). The design 
philosophy behind the model is discussed in (Clark, 1988). 


Given the DoD's worry that some of its precious hosts, routers, and internetwork gateways might get blown to 


pieces at a moment's notice, another major goal was that the network be able to survive loss of subnet 
hardware, with existing conversations not being broken off. In other words, DoD wanted connections to remain 
intact as long as the source and destination machines were functioning, even if some of the machines or 
transmission lines in between were suddenly put out of operation. Furthermore, a flexible architecture was 
needed since applications with divergent requirements were envisioned, ranging from transferring files to real- 
time speech transmission. 


The Internet Layer 


All these requirements led to the choice of a packet-switching network based on a connectionless internetwork 
layer. This layer, called the internet layer, is the linchpin that holds the whole architecture together. Its job is to 
permit hosts to inject packets into any network and have them travel independently to the destination (potentially 
on a different network). They may even arrive in a different order than they were sent, in which case it is the job 
of higher layers to rearrange them, if in-order delivery is desired. Note that "internet" is used here in a generic 
sense, even though this layer is present in the Internet. 
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The analogy here is with the (snail) mail system. A person can drop a sequence of international letters into a 
mail box in one country, and with a little luck, most of them will be delivered to the correct address in the 
destination country. Probably the letters will travel through one or more international mail gateways along the 
way, but this is transparent to the users. Furthermore, that each country (i.e., each network) has its own stamps, 
preferred envelope sizes, and delivery rules is hidden from the users. 


The internet layer defines an official packet format and protocol called IP (Internet Protocol). The job of the 
internet layer is to deliver IP packets where they are supposed to go. Packet routing is clearly the major issue 
here, as is avoiding congestion. For these reasons, it is reasonable to say that the TCP/IP internet layer is 
similar in functionality to the OSI network layer. Figure 1-21 shows this correspondence. 


Figure 1-21. The TCP/IP reference model. 
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The Transport Layer 


The layer above the internet layer in the TCP/IP model is now usually called the transport layer. It is designed to 
allow peer entities on the source and destination hosts to carry on a conversation, just as in the OSI transport 
layer. Two end-to-end transport protocols have been defined here. The first one, TCP (Transmission Control 
Protocol), is a reliable connection-oriented protocol that allows a byte stream originating on one machine to be 
delivered without error on any other machine in the internet. It fragments the incoming byte stream into discrete 
messages and passes each one on to the internet layer. At the destination, the receiving TCP process 
reassembles the received messages into the output stream. TCP also handles flow control to make sure a fast 
sender cannot swamp a slow receiver with more messages than it can handle. 


The second protocol in this layer, UDP (User Datagram Protocol), is an unreliable, connectionless protocol for 
applications that do not want TCP's sequencing or flow control and wish to provide their own. It is also widely 
used for one-shot, client-server-type request-reply queries and applications in which prompt delivery is more 


important than accurate delivery, such as transmitting speech or video. The relation of IP, TCP, and UDP 
ishown in Fig. 1-22. Since the model was developed, IP has been implemented on many other networks. 


Figure 1-22. Protocols and networks in the TCP/IP model initially. 
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The Application Layer 


The TCP/IP model does not have session or presentation layers. No need for them was perceived, so they were 
not included. Experience with the OSI model has proven this view correct: they are of little use to most 
applications. 


On top of the transport layer is the application layer. It contains all the higher-level protocols. The early ones 
included virtual terminal (TELNET), file transfer (FTP), and electronic mail (SMTP), as shown in Fig. 1-22. The 
virtual terminal protocol allows a user on one machine to log onto a distant machine and work there. The file 
transfer protocol provides a way to move data efficiently from one machine to another. Electronic mail was 
originally just a kind of file transfer, but later a specialized protocol (SMTP) was developed for it. Many other 
protocols have been added to these over the years: the Domain Name System (DNS) for mapping host names 
onto their network addresses, NNTP, the protocol for moving USENET news articles around, and HTTP, the 
protocol for fetching pages on the World Wide Web, and many others. 


The Host-to-Network Layer 


Below the internet layer is a great void. The TCP/IP reference model does not really say much about what 
happens here, except to point out that the host has to connect to the network using some protocol so it can send 
IP packets to it. This protocol is not defined and varies from host to host and network to network. Books and 
papers about the TCP/IP model rarely discuss it. 


1.4.3 A Comparison of the OSI and TCP/IP Reference Models 


The OSI and TCP/IP reference models have much in common. Both are based on the concept of a stack of 
independent protocols. Also, the functionality of the layers is roughly similar. For example, in both models the 
layers up through and including the transport layer are there to provide an end-to-end, network-independent 
transport service to processes wishing to communicate. These layers form the transport provider. Again in both 
models, the layers above transport are application-oriented users of the transport service. 


Despite these fundamental similarities, the two models also have many differences. In this section we will focus 
on the key differences between the two reference models. It is important to note that we are comparing the 
reference models here, not the corresponding protocol stacks. The protocols themselves will be discussed later. 
For an entire book comparing and contrasting TCP/IP and OSI, see (Piscitello and Chapin, 1993). 


Three concepts are central to the OSI model: 


1. Services. 
2. Interfaces. 
3. Protocols. 


Probably the biggest contribution of the OSI model is to make the distinction between these three concepts 
explicit. Each layer performs some services for the layer above it. The service definition tells what the layer does, 
not how entities above it access it or how the layer works. It defines the layer's semantics. 


A layer's interface tells the processes above it how to access it. It specifies what the parameters are and what 
results to expect. It, too, says nothing about how the layer works inside. 


Finally, the peer protocols used in a layer are the layer's own business. It can use any protocols it wants to, as 
long as it gets the job done (i.e., provides the offered services). It can also change them at will without affecting 
software in higher layers. 


These ideas fit very nicely with modern ideas about object-oriented programming. An object, like a layer, has a 
set of methods (operations) that processes outside the object can invoke. The semantics of these methods 
define the set of services that the object offers. The methods' parameters and results form the object's interface. 
The code internal to the object is its protocol and is not visible or of any concern outside the object. 


The TCP/IP model did not originally clearly distinguish between service, interface, and protocol, although people 
have tried to retrofit it after the fact to make it more OSI-like. For example, the only real services offered by the 
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internet layer are SEND IP PACKET and RECEIVE IP PACKET. 


As a consequence, the protocols in the OSI model are better hidden than in the TCP/IP model and can be 
replaced relatively easily as the technology changes. Being able to make such changes is one of the main 
purposes of having layered protocols in the first place. 


The OSI reference model was devised before the corresponding protocols were invented. This ordering means 
that the model was not biased toward one particular set of protocols, a fact that made it quite general. The 
downside of this ordering is that the designers did not have much experience with the subject and did not have a 
good idea of which functionality to put in which layer. 


For example, the data link layer originally dealt only with point-to-point networks. When broadcast networks 
came around, a new sublayer had to be hacked into the model. When people started to build real networks using 
the OSI model and existing protocols, it was discovered that these networks did not match the required service 
specifications (wonder of wonders), so convergence sublayers had to be grafted onto the model to provide a 
place for papering over the differences. Finally, the committee originally expected that each country would have 
one network, run by the government and using the OSI protocols, so no thought was given to internetworking. 
To make a long story short, things did not turn out that way. 


With TCP/IP the reverse was true: the protocols came first, and the model was really just a description of the 
existing protocols. There was no problem with the protocols fitting the model. They fit perfectly. The only trouble 
was that the model did not fit any other protocol stacks. Consequently, it was not especially useful for describing 
other, non-TCP/IP networks. 


Turning from philosophical matters to more specific ones, an obvious difference between the two models is the 
number of layers: the OSI model has seven layers and the TCP/IP has four layers. Both have (inter)network, 
transport, and application layers, but the other layers are different. 


Another difference is in the area of connectionless versus connection-oriented communication. The OSI model 
supports both connectionless and connection-oriented communication in the network layer, but only connection- 
oriented communication in the transport layer, where it counts (because the transport service is visible to the 
users). The TCP/IP model has only one mode in the network layer (connectionless) but supports both modes in 
the transport layer, giving the users a choice. This choice is especially important for simple request-response 
protocols. 


1.4.4 A Critique of the OSI Model and Protocols 


Neither the OSI model and its protocols nor the TCP/IP model and its protocols are perfect. Quite a bit of 
criticism can be, and has been, directed at both of them. In this section and the next one, we will look at some of 


these criticisms. We will begin with OSI and examine TCP/IP afterward. 


At the time the second edition of this book was published (1989), it appeared to many experts in the field that the 
OSI model and its protocols were going to take over the world and push everything else out of their way. This did 
not happen. Why? A look back at some of the lessons may be useful. These lessons can be summarized as: 


1. Bad timing. 

2. Badtechnology. 

3. Bad implementations. 
4. Bad politics. 


Bad Timing 


First let us look at reason one: bad timing. The time at which a standard is established is absolutely critical to its 
success. David Clark of M.I.T. has a theory of standards that he calls the apocalypse of the two elephants, which 
is illustrated in Fig. 1-23. 
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Figure 1-23. The apocalypse of the two elephants. 


Billion dollar 
Research investment 


- 


Standards 


Activity 


t 


Time - 


This figure shows the amount of activity surrounding a new subject. When the subject is first discovered, there is 
a burst of research activity in the form of discussions, papers, and meetings. After a while this activity subsides, 
corporations discover the subject, and the billion-dollar wave of investment hits. 


It is essential that the standards be written in the trough in between the two "elephants." If the standards are 
written too early, before the research is finished, the subject may still be poorly understood; the result is bad 
standards. If they are written too late, so many companies may have already made major investments in 
different ways of doing things that the standards are effectively ignored. If the interval between the two elephants 
is very short (because everyone is in a hurry to get started), the people developing the standards may get 
crushed. 


It now appears that the standard OSI protocols got crushed. The competing TCP/IP protocols were already in 
widespread use by research universities by the time the OSI protocols appeared. While the billion-dollar wave of 
investment had not yet hit, the academic market was large enough that many vendors had begun cautiously 
offering TCP/IP products. When OSI came around, they did not want to support a second protocol stack until 
they were forced to, so there were no initial offerings. With every company waiting for every other company to go 
first, no company went first and OSI never happened. 


Bad Technology 
The second reason that OSI never caught on is that both the model and the protocols are flawed. The choice of 
seven layers was more political than technical, and two of the layers (session and presentation) are nearly 


empty, whereas two other ones (data link and network) are overfull. 


The OSI model, along with the associated service definitions and protocols, is extraordinarily complex. When 


piled up, the printed standards occupy a significant fraction of a meter of paper. They are also difficult to 
implement and inefficient in operation. In this context, a riddle posed by Paul Mockapetris and cited in (Rose, 
1993) comes to mind: 


Q1: What do you get when you cross a mobster with an international standard? 
A1: Someone who makes you an offer you can't understand. 
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In addition to being incomprehensible, another problem with OSI is that some functions, such as addressing, 
flow control, and error control, reappear again and again in each layer. Saltzer et al. (1984), for example, have 
pointed out that to be effective, error control must be done in the highest layer, so that repeating it over and over 
in each of the lower layers is often unnecessary and inefficient. 


Bad Implementations 


Given the enormous complexity of the model and the protocols, it will come as no surprise that the initial 
implementations were huge, unwieldy, and slow. Everyone who tried them got burned. It did not take long for 
people to associate "OSI" with "poor quality." Although the products improved in the course of time, the image 
stuck. 


In contrast, one of the first implementations of TCP/IP was part of Berkeley UNIX and was quite good (not to 
mention, free). People began using it quickly, which led to a large user community, which led to improvements, 
which led to an even larger community. Here the spiral was upward instead of downward. 


Bad Politics 


On account of the initial implementation, many people, especially in academia, thought of TCP/IP as part of 
UNIX, and UNIX in the 1980s in academia was not unlike parenthood (then incorrectly called motherhood) and 


apple pie. 


OSI, on the other hand, was widely thought to be the creature of the European telecommunication ministries, the 
European Community, and later the U.S. Government. This belief was only partly true, but the very idea of a 
bunch of government bureaucrats trying to shove a technically inferior standard down the throats of the poor 
researchers and programmers down in the trenches actually developing computer networks did not help much. 
Some people viewed this development in the same light as IBM announcing in the 1960s that PL/I was the 
language of the future, or DoD correcting this later by announcing that it was actually Ada. 


1.4.5 A Critique of the TCP/IP Reference Model 


The TCP/IP model and protocols have their problems too. First, the model does not clearly distinguish the 
concepts of service, interface, and protocol. Good software engineering practice requires differentiating between 
the specification and the implementation, something that OSI does very carefully, and TCP/IP does not. 
Consequently, the TCP/IP model is not much of a guide for designing new networks using new technologies. 


Second, the TCP/IP model is not at all general and is poorly suited to describing any protocol stack other than 
TCP/IP. Trying to use the TCP/IP model to describe Bluetooth, for example, is completely impossible. 


Third, the host-to-network layer is not really a layer at all in the normal sense of the term as used in the context 
of layered protocols. It is an interface (between the network and data link layers). The distinction between an 
interface and a layer is crucial, and one should not be sloppy about it. 


Fourth, the TCP/IP model does not distinguish (or even mention) the physical and data link layers. These are 
completely different. The physical layer has to do with the transmission characteristics of copper wire, fiber 
optics, and wireless communication. The data link layer's job is to delimit the start and end of frames and get 
them from one side to the other with the desired degree of reliability. A proper model should include both as 
separate layers. The TCP/IP model does not do this. 


Finally, although the IP and TCP protocols were carefully thought out and well implemented, many of the other 
protocols were ad hoc, generally produced by a couple of graduate students hacking away until they got tired. 
The protocol implementations were then distributed free, which resulted in their becoming widely used, deeply 
entrenched, and thus hard to replace. Some of them are a bit of an embarrassment now. The virtual terminal 
protocol, TELNET, for example, was designed for a ten-character per second mechanical Teletype terminal. It 
knows nothing of graphical user interfaces and mice. Nevertheless, 25 years later, it is still in widespread use. 


In summary, despite its problems, the OSI model (minus the session and presentation layers) has proven to be 
exceptionally useful for discussing computer networks. In contrast, the OSI protocols have not become popular. 
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The reverse is true of TCP/IP: the model is practically nonexistent, but the protocols are widely used. Since 
computer scientists like to have their cake and eat it, too, in this book we will use a modified OSI model but 
concentrate primarily on the TCP/IP and related protocols, as well as newer ones such as 802, SONET, and 
Bluetooth. In effect, we will use the hybrid model of Fig. 1-24 as the framework for this book. 


Figure 1-24. The hybrid reference model to be used in this book. 
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1.5 Example Networks 


The subject of computer networking covers many different kinds of networks, large and small, well known and 
less well known. They have different goals, scales, and technologies. In the following sections, we will look at 
some examples, to get an idea of the variety one finds in the area of computer networking. 


We will start with the Internet, probably the best known network, and look at its history, evolution, and 
technology. Then we will consider ATM, which is often used within the core of large (telephone) networks. 
Technically, it is quite different from the Internet, contrasting nicely with it. Next we will introduce Ethernet, the 
dominant local area network. Finally, we will look at IEEE 802.11, the standard for wireless LANs. 


1.5.1 The Internet 


The Internet is not a network at all, but a vast collection of different networks that use certain common protocols 
and provide certain common services. It is an unusual system in that it was not planned by anyone and is not 
controlled by anyone. To better understand it, let us start from the beginning and see how it has developed and 
why. For a wonderful history of the Internet, John Naughton's (2000) book is highly recommended. It is one of 
those rare books that is not only fun to read, but also has 20 pages of ibid.'s and op. cit.'s for the serious 
historian. Some of the material below is based on this book. 


Of course, countless technical books have been written about the Internet and its protocols as well. For more 
information, see, for example, (Maufer, 1999). 


The ARPANET 


The story begins in the late 1950s. At the height of the Cold War, the DoD wanted a command-and-control 
network that could survive a nuclear war. At that time, all military communications used the public telephone 
network, which was considered vulnerable. The reason for this belief can be gleaned from Fig. 1-25(a). Here the 


black dots represent telephone switching offices, each of which was connected to thousands of telephones. 
These switching offices were, in turn, connected to higher-level switching offices (toll offices), to form a national 
hierarchy with only a small amount of redundancy. The vulnerability of the system was that the destruction of a 
few key toll offices could fragment the system into many isolated islands. 
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Figure 1-25. (a) Structure of the telephone system. (b) Baran's proposed distributed switching system. 
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Around 1960, the DoD awarded a contract to the RAND Corporation to find a solution. One of its employees, 
Paul Baran, came up with the highly distributed and fault-tolerant design of Fig. 1-25(b). Since the paths 
between any two switching offices were now much longer than analog signals could travel without distortion, 
Baran proposed using digital packet-switching technology throughout the system. Baran wrote several reports 
for the DoD describing his ideas in detail. Officials at the Pentagon liked the concept and asked AT&T, then the 
U.S. national telephone monopoly, to build a prototype. AT&T dismissed Baran's ideas out of hand. The biggest 
and richest corporation in the world was not about to allow some young whippersnapper tell it how to build a 
telephone system. They said Baran's network could not be built and the idea was killed. 


Several years went by and still the DoD did not have a better command-and-control system. To understand what 
happened next, we have to go back to October 1957, when the Soviet Union beat the U.S. into space with the 
launch of the first artificial satellite, Sputnik. When President Eisenhower tried to find out who was asleep at the 
switch, he was appalled to find the Army, Navy, and Air Force squabbling over the Pentagon's research budget. 
His immediate response was to create a single defense research organization, ARPA, the Advanced Research 
Projects Agency. ARPA had no scientists or laboratories; in fact, it had nothing more than an office and a small 
(by Pentagon standards) budget. It did its work by issuing grants and contracts to universities and companies 
whose ideas looked promising to it. 


For the first few years, ARPA tried to figure out what its mission should be, but in 1967, the attention of ARPA's 
then director, Larry Roberts, turned to networking. He contacted various experts to decide what to do. One of 
them, Wesley Clark, suggested building a packet-switched subnet, giving each host its own router, as illustrated 
in Fig. 1-10. 


After some initial skepticism, Roberts bought the idea and presented a somewhat vague paper about it at the 
ACM SIGOPS Symposium on Operating System Principles held in Gatlinburg, Tennessee in late 1967 (Roberts, 
1967). Much to Roberts' surprise, another paper at the conference described a similar system that had not only 
been designed but actually implemented under the direction of Donald Davies at the National Physical 
Laboratory in England. The NPL system was not a national system (it just connected several computers on the 
NPL campus), but it demonstrated that packet switching could be made to work. Furthermore, it cited Baran's 
now discarded earlier work. Roberts came away from Gatlinburg determined to build what later became known 
as the ARPANET. 


The subnet would consist of minicomputers called IMPs (Interface Message Processors) connected by 56-kbps 
transmission lines. For high reliability, each IMP would be connected to at least two other IMPs. The subnet was 
to be a datagram subnet, so if some lines and IMPs were destroyed, messages could be automatically rerouted 
along alternative paths. 


Each node of the network was to consist of an IMP and a host, in the same room, connected by a short wire. A 
host could send messages of up to 8063 bits to its IMP, which would then break these up into packets of at most 
1008 bits and forward them independently toward the destination. Each packet was received in its entirety 


46 


before being forwarded, so the subnet was the first electronic store-and-forward packet-switching network. 


ARPA then put out a tender for building the subnet. Twelve companies bid for it. After evaluating all the 
proposals, ARPA selected BBN, a consulting firm in Cambridge, Massachusetts, and in December 1968, 
awarded it a contract to build the subnet and write the subnet software. BBN chose to use specially modified 
Honeywell DDP-316 minicomputers with 12K 16-bit words of core memory as the IMPs. The IMPs did not have 
disks, since moving parts were considered unreliable. The IMPs were interconnected by 56-kbps lines leased 
from telephone companies. Although 56 kbps is now the choice of teenagers who cannot afford ADSL or cable, 
it was then the best money could buy. 


The software was split into two parts: subnet and host. The subnet software consisted of the IMP end of the 
host-IMP connection, the IMP-IMP protocol, and a source IMP to destination IMP protocol designed to improve 
reliability. The original ARPANET design is shown in Fig. 1-26. 


Figure 1-26. The original ARPANET design. 
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Outside the subnet, software was also needed, namely, the host end of the host-IMP connection, the host-host 
protocol, and the application software. It soon became clear that BBN felt that when it had accepted a message 
on a host-IMP wire and placed it on the host-IMP wire at the destination, its job was done. 


Roberts had a problem: the hosts needed software too. To deal with it, he convened a meeting of network 
researchers, mostly graduate students, at Snowbird, Utah, in the summer of 1969. The graduate students 
expected some network expert to explain the grand design of the network and its software to them and then to 
assign each of them the job of writing part of it. They were astounded when there was no network expert and no 
grand design. They had to figure out what to do on their own. 


Nevertheless, somehow an experimental network went on the air in December 1969 with four nodes: at UCLA, 
UCSB, SRI, and the University of Utah. These four were chosen because all had a large number of ARPA 
contracts, and all had different and completely incompatible host computers (just to make it more fun). The 
network grew quickly as more IMPs were delivered and installed; it soon spanned the United States. Figure 1-27 
shows how rapidly the ARPANET grew in the first 3 years. 
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Figure 1-27. Growth of the ARPANET. (a) December 1969. (b) July 1970. (c) March 1971. (d) April 1972. (e) 
September 1972. 
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In addition to helping the fledgling ARPANET grow, ARPA also funded research on the use of satellite networks 
and mobile packet radio networks. In one now famous demonstration, a truck driving around in California used 
the packet radio network to send messages to SRI, which were then forwarded over the ARPANET to the East 
Coast, where they were shipped to University College in London over the satellite network. This allowed a 
researcher in the truck to use a computer in London while driving around in California. 


This experiment also demonstrated that the existing ARPANET protocols were not suitable for running over 
multiple networks. This observation led to more research on protocols, culminating with the invention of the 
TCP/IP model and protocols (Cerf and Kahn, 1974). TCP/IP was specifically designed to handle communication 
over internetworks, something becoming increasingly important as more and more networks were being hooked 
up to the ARPANET. 


To encourage adoption of these new protocols, ARPA awarded several contracts to BBN and the University of 
California at Berkeley to integrate them into Berkeley UNIX. Researchers at Berkeley developed a convenient 
program interface to the network (sockets) and wrote many application, utility, and management programs to 
make networking easier. 


The timing was perfect. Many universities had just acquired a second or third VAX computer and a LAN to 
connect them, but they had no networking software. When 4.2BSD came along, with TCP/IP, sockets, and many 
network utilities, the complete package was adopted immediately. Furthermore, with TCP/IP, it was easy for the 
LANs to connect to the ARPANET, and many did. 


During the 1980s, additional networks, especially LANs, were connected to the ARPANET. As the scale 
increased, finding hosts became increasingly expensive, so DNS (Domain Name System) was created to 
organize machines into domains and map host names onto IP addresses. Since then, DNS has become a 
generalized, distributed database system for storing a variety of information related to naming. We will study it in 
detail in Chap. 7. 
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NSFNET 


By the late 1970s, NSF (the U.S. National Science Foundation) saw the enormous impact the ARPANET was 
having on university research, allowing scientists across the country to share data and collaborate on research 
projects. However, to get on the ARPANET, a university had to have a research contract with the DoD, which 
many did not have. NSF's response was to design a successor to the ARPANET that would be open to all 
university research groups. To have something concrete to start with, NSF decided to build a backbone network 


to connect its six supercomputer centers, in San Diego, Boulder, Champaign, Pittsburgh, Ithaca, and Princeton. 
Each supercomputer was given a little brother, consisting of an LSI-11 microcomputer called a fuzzball. The 
fuzzballs were connected with 56-kbps leased lines and formed the subnet, the same hardware technology as 
the ARPANET used. The software technology was different however: the fuzzballs spoke TCP/IP right from the 
start, making it the first TCP/IP WAN. 


NSF also funded some (eventually about 20) regional networks that connected to the backbone to allow users at 
thousands of universities, research labs, libraries, and museums to access any of the supercomputers and to 
communicate with one another. The complete network, including the backbone and the regional networks, was 
called NSFNET. It connected to the ARPANET through a link between an IMP and a fuzzball in the Carnegie- 
Mellon machine room. The first NSFNET backbone is illustrated in Fig. 1-28. 


Figure 1-28. The NSFNET backbone in 1988. 
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NSFNET was an instantaneous success and was overloaded from the word go. NSF immediately began 
planning its successor and awarded a contract to the Michigan-based MERIT consortium to run it. Fiber optic 
channels at 448 kbps were leased from MCI (since merged with WorldCom) to provide the version 2 backbone. 
IBM PC-RTs were used as routers. This, too, was soon overwhelmed, and by 1990, the second backbone was 
upgraded to 1.5 Mbps. 


As growth continued, NSF realized that the government could not continue financing networking forever. 
Furthermore, commercial organizations wanted to join but were forbidden by NSF's charter from using networks 
NSF paid for. Consequently, NSF encouraged MERIT, MCI, and IBM to form a nonprofit corporation, ANS 
(Advanced Networks and Services), as the first step along the road to commercialization. In 1990, ANS took 
over NSFNET and upgraded the 1.5-Mbps links to 45 Mbps to form ANSNET. This network operated for 5 years 
and was then sold to America Online. But by then, various companies were offering commercial IP service and it 
was Clear the government should now get out of the networking business. 


To ease the transition and make sure every regional network could communicate with every other regional 
network, NSF awarded contracts to four different network operators to establish a NAP (Network Access Point). 
These operators were PacBell (San Francisco), Ameritech (Chicago), MFS (Washington, D.C.), and Sprint (New 
York City, where for NAP purposes, Pennsauken, New Jersey counts as New York City). Every network operator 
that wanted to provide backbone service to the NSF regional networks had to connect to all the NAPs. 


This arrangement meant that a packet originating on any regional network had a choice of backbone carriers to 
49 


get from its NAP to the destination's NAP. Consequently, the backbone carriers were forced to compete for the 
regional networks' business on the basis of service and price, which was the idea, of course. As a result, the 
concept of a single default backbone was replaced by a commercially-driven competitive infrastructure. Many 
people like to criticize the Federal Government for not being innovative, but in the area of networking, it was DoD 
and NSF that created the infrastructure that formed the basis for the Internet and then handed it over to industry 
to operate. 


During the 1990s, many other countries and regions also built national research networks, often patterned on the 
ARPANET and NSFNET. These included EuropaNET and EBONE in Europe, which started out with 2-Mbps 
lines and then upgraded to 34-Mbps lines. Eventually, the network infrastructure in Europe was handed over to 
industry as well. 


Internet Usage 


The number of networks, machines, and users connected to the ARPANET grew rapidly after TCP/IP became 
the only official protocol on January 1, 1983. When NSFNET and the ARPANET were interconnected, the growth 
became exponential. Many regional networks joined up, and connections were made to networks in Canada, 
Europe, and the Pacific. 


Sometime in the mid-1980s, people began viewing the collection of networks as an internet, and later as the 
Internet, although there was no official dedication with some politician breaking a bottle of champagne over a 
fuzzball. 


The glue that holds the Internet together is the TCP/IP reference model and TCP/IP protocol stack. TCP/IP 
makes universal service possible and can be compared to the adoption of standard gauge by the railroads in the 
19th century or the adoption of common signaling protocols by all the telephone companies. 


What does it actually mean to be on the Internet? Our definition is that a machine is on the Internet if it runs the 
TCP/IP protocol stack, has an IP address, and can send IP packets to all the other machines on the Internet. 
The mere ability to send and receive electronic mail is not enough, since e-mail is gatewayed to many networks 
outside the Internet. However, the issue is clouded somewhat by the fact that millions of personal computers can 
call up an Internet service provider using a modem, be assigned a temporary IP address, and send IP packets to 
other Internet hosts. It makes sense to regard such machines as being on the Internet for as long as they are 
connected to the service provider's router. 


Traditionally (meaning 1970 to about 1990), the Internet and its predecessors had four main applications: 


1. E-mail. The ability to compose, send, and receive electronic mail has been around since the early days 
of the ARPANET and is enormously popular. Many people get dozens of messages a day and consider 
it their primary way of interacting with the outside world, far outdistancing the telephone and snail mail. 
E-mail programs are available on virtually every kind of computer these days. 

2. News. Newsgroups are specialized forums in which users with a common interest can exchange 
messages. Thousands of newsgroups exist, devoted to technical and nontechnical topics, including 
computers, science, recreation, and politics. Each newsgroup has its own etiquette, style, and customs, 
and woe betide anyone violating them. 

3. Remote login. Using the telnet, rlogin, or ssh programs, users anywhere on the Internet can log on to 
any other machine on which they have an account. 

4. File transfer. Using the FTP program, users can copy files from one machine on the Internet to another. 
Vast numbers of articles, databases, and other information are available this way. 


Up until the early 1990s, the Internet was largely populated by academic, government, and industrial 
researchers. One new application, the WWW (World Wide Web) changed all that and brought millions of new, 
nonacademic users to the net. This application, invented by CERN physicist Tim Berners-Lee, did not change 
any of the underlying facilities but made them easier to use. Together with the Mosaic browser, written by Marc 
Andreessen at the National Center for Supercomputer Applications in Urbana, Illinois, the WWW made it 
possible for a site to set up a number of pages of information containing text, pictures, sound, and even video, 
with embedded links to other pages. By clicking on a link, the user is suddenly transported to the page pointed to 
by that link. For example, many companies have a home page with entries pointing to other pages for product 
information, price lists, sales, technical support, communication with employees, stockholder information, and 
more. 
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Numerous other kinds of pages have come into existence in a very short time, including maps, stock market 
tables, library card catalogs, recorded radio programs, and even a page pointing to the complete text of many 
books whose copyrights have expired (Mark Twain, Charles Dickens, etc.). Many people also have personal 
pages (home pages). 

Much of this growth during the 1990s was fueled by companies called ISPs (Internet Service Providers). These 
are companies that offer individual users at home the ability to call up one of their machines and connect to the 
Internet, thus gaining access to e-mail, the WWW, and other Internet services. These companies signed up tens 
of millions of new users a year during the late 1990s, completely changing the character of the network from an 
academic and military playground to a public utility, much like the telephone system. The number of Internet 
users now is unknown, but is certainly hundreds of millions worldwide and will probably hit 1 billion fairly soon. 


Architecture of the Internet 


In this section we will attempt to give a brief overview of the Internet today. Due to the many mergers between 
telephone companies (telcos) and ISPs, the waters have become muddied and it is often hard to tell who is 
doing what. Consequently, this description will be of necessity somewhat simpler than reality. The big picture is 
shown in Fig. 1-29. Let us examine this figure piece by piece now. 


Figure 1-29. Overview of the Internet. 
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A good place to start is with a client at home. Let us assume our client calls his or her ISP over a dial-up 
telephone line, as shown in Fig. 1-29. The modem is a card within the PC that converts the digital signals the 
computer produces to analog signals that can pass unhindered over the telephone system. These signals are 
transferred to the ISP's POP (Point of Presence), where they are removed from the telephone system and 
injected into the ISP's regional network. From this point on, the system is fully digital and packet switched. If the 
ISP is the local telco, the POP will probably be located in the telephone switching office where the telephone 
wire from the client terminates. If the ISP is not the local telco, the POP may be a few switching offices down the 
road. 


The ISP's regional network consists of interconnected routers in the various cities the ISP serves. If the packet is 
destined for a host served directly by the ISP, the packet is delivered to the host. Otherwise, it is handed over to 
the ISP's backbone operator. 


At the top of the food chain are the major backbone operators, companies like AT&T and Sprint. They operate 
large international backbone networks, with thousands of routers connected by high-bandwidth fiber optics. 
Large corporations and hosting services that run server farms (machines that can serve thousands of Web 
pages per second) often connect directly to the backbone. Backbone operators encourage this direct connection 
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by renting space in what are called carrier hotels, basically equipment racks in the same room as the router to 
allow short, fast connections between server farms and the backbone. 


If a packet given to the backbone is destined for an ISP or company served by the backbone, it is sent to the 
closest router and handed off there. However, many backbones, of varying sizes, exist in the world, so a packet 
may have to go to a competing backbone. To allow packets to hop between backbones, all the major backbones 
connect at the NAPs discussed earlier. Basically, a NAP is a room full of routers, at least one per backbone. A 
LAN in the room connects all the routers, so packets can be forwarded from any backbone to any other 
backbone. In addition to being interconnected at NAPs, the larger backbones have numerous direct connections 
between their routers, a technique known as private peering. One of the many paradoxes of the Internet is that 
ISPs who publicly compete with one another for customers often privately cooperate to do private peering (Metz, 
2001). 


This ends our quick tour of the Internet. We will have a great deal to say about the individual components and 
their design, algorithms, and protocols in subsequent chapters. Also worth mentioning in passing is that some 
companies have interconnected all their existing internal networks, often using the same technology as the 
Internet. These intranets are typically accessible only within the company but otherwise work the same way as 
the Internet. 


1.5.2 Connection-Oriented Networks: X.25, Frame Relay, and ATM 


Since the beginning of networking, a war has been going on between the people who support connectionless 
(i.e., datagram) subnets and the people who support connection-oriented subnets. The main proponents of the 
connectionless subnets come from the ARPANET/Internet community. Remember that DoD's original desire in 
funding and building the ARPANET was to have a network that would continue functioning even after multiple 
direct hits by nuclear weapons wiped out numerous routers and transmission lines. Thus, fault tolerance was 
high on their priority list; billing customers was not. This approach led to a connectionless design in which every 
packet is routed independently of every other packet. As a consequence, if some routers go down during a 
session, no harm is done as long as the system can reconfigure itself dynamically so that subsequent packets 
can find some route to the destination, even if it is different from that which previous packets used. 


The connection-oriented camp comes from the world of telephone companies. In the telephone system, a caller 
must dial the called party's number and wait for a connection before talking or sending data. This connection 
setup establishes a route through the telephone system that is maintained until the call is terminated. All words 
or packets follow the same route. If a line or switch on the path goes down, the call is aborted. This property is 
precisely what the DoD did not like about it. 


Why do the telephone companies like it then? There are two reasons: 


1. Quality of service. 
2. Billing. 


By setting up a connection in advance, the subnet can reserve resources such as buffer space and router CPU 
capacity. If an attempt is made to set up a call and insufficient resources are available, the call is rejected and 
the caller gets a kind of busy signal. In this way, once a connection has been set up, the connection will get good 
service. With a connectionless network, if too many packets arrive at the same router at the same moment, the 
router will choke and probably lose packets. The sender will eventually notice this and resend them, but the 
quality of service will be jerky and unsuitable for audio or video unless the network is very lightly loaded. 
Needless to say, providing adequate audio quality is something telephone companies care about very much, 
hence their preference for connections. 


The second reason the telephone companies like connection-oriented service is that they are accustomed to 
charging for connect time. When you make a long distance call (or even a local call outside North America) you 
are charged by the minute. When networks came around, they just automatically gravitated toward a model in 
which charging by the minute was easy to do. If you have to set up a connection before sending data, that is 
when the billing clock starts running. If there is no connection, they cannot charge for it. 


Ironically, maintaining billing records is very expensive. If a telephone company were to adopt a flat monthly rate 
with unlimited calling and no billing or record keeping, it would probably save a huge amount of money, despite 
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the increased calling this policy would generate. Political, regulatory, and other factors weigh against doing this, 
however. Interestingly enough, flat rate service exists in other sectors. For example, cable TV is billed at a flat 


rate per month, no matter how many programs you watch. It could have been designed with pay-per-view as the 
basic concept, but it was not, due in part to the expense of billing (and given the quality of most television, the 
embarrassment factor cannot be totally discounted either). Also, many theme parks charge a daily admission fee 
for unlimited rides, in contrast to traveling carnivals, which charge by the ride. 


That said, it should come as no surprise that all networks designed by the telephone industry have had 
connection-oriented subnets. What is perhaps surprising, is that the Internet is also drifting in that direction, in 
order to provide a better quality of service for audio and video, a subject we will return to in Chap. 5. But now let 
us examine some connection-oriented networks. 


X.25 and Frame Relay 


Our first example of a connection-oriented network is X.25, which was the first public data network. It was 
deployed in the 1970s at a time when telephone service was a monopoly everywhere and the telephone 
company in each country expected there to be one data network per country—theirs. To use X.25, a computer 
first established a connection to the remote computer, that is, placed a telephone call. This connection was given 
a connection number to be used in data transfer packets (because multiple connections could be open at the 
same time). Data packets were very simple, consisting of a 3-byte header and up to 128 bytes of data. The 
header consisted of a 12-bit connection number, a packet sequence number, an acknowledgement number, and 
a few miscellaneous bits. X.25 networks operated for about a decade with mixed success. 


In the 1980s, the X.25 networks were largely replaced by a new kind of network called frame relay. The essence 
of frame relay is that it is a connection-oriented network with no error control and no flow control. Because it was 
connection-oriented, packets were delivered in order (if they were delivered at all). The properties of in-order 
delivery, no error control, and no flow control make frame relay akin to a wide area LAN. Its most important 
application is interconnecting LANs at multiple company offices. Frame relay enjoyed a modest success and is 
still in use in places today. 


Asynchronous Transfer Mode 


Yet another, and far more important, connection-oriented network is ATM (Asynchronous Transfer Mode). The 
reason for the somewhat strange name is that in the telephone system, most transmission is synchronous 
(closely tied to a clock), and ATM is not. 


ATM was designed in the early 1990s and launched amid truly incredible hype (Ginsburg, 1996; Goralski, 1995; 
Ibe, 1997; Kim et al., 1994; and Stallings, 2000). ATM was going to solve all the world's networking and 
telecommunications problems by merging voice, data, cable television, telex, telegraph, carrier pigeon, tin cans 
connected by strings, tom-toms, smoke signals, and everything else into a single integrated system that could do 
everything for everyone. It did not happen. In large part, the problems were similar to those we described earlier 
concerning OSI, that is, bad timing, technology, implementation, and politics. Having just beaten back the 
telephone companies in round 1, many in the Internet community saw ATM as Internet versus the Telcos: the 
Sequel. But it really was not, and this time around even diehard datagram fanatics were aware that the Internet's 
quality of service left a lot to be desired. To make a long story short, ATM was much more successful than OSI, 
and it is now widely used deep within the telephone system, often for moving IP packets. Because it is now 
mostly used by carriers for internal transport, users are often unaware of its existence, but it is definitely alive 
and well. 


ATM Virtual Circuits 


Since ATM networks are connection-oriented, sending data requires first sending a packet to set up the 
connection. As the setup packet wends its way through the subnet, all the routers on the path make an entry in 
their internal tables noting the existence of the connection and reserving whatever resources are needed for it. 
Connections are often called virtual circuits, in analogy with the physical circuits used within the telephone 
system. Most ATM networks also support permanent virtual circuits, which are permanent connections between 
two (distant) hosts. They are similar to leased lines in the telephone world. Each connection, temporary or 
permanent, has a unique connection identifier. A virtual circuit is illustrated in Fig. 1-30. 


53 


Figure 1-30. A virtual circuit. 


Router Subnet 
Sending host —— Pd Receiving host 


[ e dam SER 
X WW Lx 


+ 
Sending process Virtual circuit Receiving process 


Once a connection has been established, either side can begin transmitting data. The basic idea behind ATM is 
to transmit all information in small, fixed-size packets called cells. The cells are 53 bytes long, of which 5 bytes 
are header and 48 bytes are payload, as shown in Fig. 1-31. Part of the header is the connection identifier, so 
the sending and receiving hosts and all the intermediate routers can tell which cells belong to which connections. 
This information allows each router to know how to route each incoming cell. Cell routing is done in hardware, at 
high speed. In fact, the main argument for having fixed-size cells is that it is easy to build hardware routers to 
handle short, fixed-length cells. Variable-length IP packets have to be routed by software, which is a slower 
process. Another plus of ATM is that the hardware can be set up to copy one incoming cell to multiple output 
lines, a property that is required for handling a television program that is being broadcast to many receivers. 
Finally, small cells do not block any line for very long, which makes guaranteeing quality of service easier. 


Figure 1-31. An ATM cell. 
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All cells follow the same route to the destination. Cell delivery is not guaranteed, but their order is. If cells 1 and 2 
are sent in that order, then if both arrive, they will arrive in that order, never first 2 then 1. But either or both of 
them can be lost along the way. It is up to higher protocol levels to recover from lost cells. Note that although this 
guarantee is not perfect, it is better than what the Internet provides. There packets can not only be lost, but 
delivered out of order as well. ATM, in contrast, guarantees never to deliver cells out of order. 


ATM networks are organized like traditional WANs, with lines and switches (routers). The most common speeds 
for ATM networks are 155 Mbps and 622 Mbps, although higher speeds are also supported. The 155-Mbps 
speed was chosen because this is about what is needed to transmit high definition television. The exact choice 
of 155.52 Mbps was made for compatibility with AT&T's SONET transmission system, something we will study in 
Chap. 2. The 622 Mbps speed was chosen so that four 155-Mbps channels could be sent over it. 


The ATM Reference Model 


ATM has its own reference model, different from the OSI model and also different from the TCP/IP model. This 
model is shown in Fig. 1-32. It consists of three layers, the physical, ATM, and ATM adaptation layers, plus 
whatever users want to put on top of that. 


54 


Figure 1-32. The ATM reference model 
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The physical layer deals with the physical medium: voltages, bit timing, and various other issues. ATM does not 
prescribe a particular set of rules but instead says that ATM cells can be sent on a wire or fiber by themselves, 
but they can also be packaged inside the payload of other carrier systems. In other words, ATM has been 
designed to be independent of the transmission medium. 


The ATM layer deals with cells and cell transport. It defines the layout of a cell and tells what the header fields 
mean. It also deals with establishment and release of virtual circuits. Congestion control is also located here. 


Because most applications do not want to work directly with cells (although some may), a layer above the ATM 
layer has been defined to allow users to send packets larger than a cell. The ATM interface segments these 
packets, transmits the cells individually, and reassembles them at the other end. This layer is the AAL (ATM 
Adaptation Layer). 


Unlike the earlier two-dimensional reference models, the ATM model is defined as being three-dimensional, as 
shown in Fig. 1-32. The user plane deals with data transport, flow control, error correction, and other user 
functions. In contrast, the control plane is concerned with connection management. The layer and plane 
management functions relate to resource management and interlayer coordination. 


The physical and AAL layers are each divided into two sublayers, one at the bottom that does the work and a 
convergence sublayer on top that provides the proper interface to the layer above it. The functions of the layers 
and sublayers are given in Fig. 1-33. 
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Figure 1-33. The ATM layers and sublayers, and their functions. 
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The PMD (Physical Medium Dependent) sublayer interfaces to the actual cable. It moves the bits on and off and 
handles the bit timing. For different carriers and cables, this layer will be different. 


The other sublayer of the physical layer is the TC (Transmission Convergence) sublayer. When cells are 
transmitted, the TC layer sends them as a string of bits to the PMD layer. Doing this is easy. At the other end, 
the TC sublayer gets a pure incoming bit stream from the PMD sublayer. Its job is to convert this bit stream into 
a cell stream for the ATM layer. It handles all the issues related to telling where cells begin and end in the bit 
stream. In the ATM model, this functionality is in the physical layer. In the OSI model and in pretty much all other 
networks, the job of framing, that is, turning a raw bit stream into a sequence of frames or cells, is the data link 
layer's task. 


As we mentioned earlier, the ATM layer manages cells, including their generation and transport. Most of the 
interesting aspects of ATM are located here. It is a mixture of the OSI data link and network layers; it is not split 
into sublayers. 


The AAL layer is split into a SAR (Segmentation And Reassembly) sublayer and a CS (Convergence Sublayer). 
The lower sublayer breaks up packets into cells on the transmission side and puts them back together again at 
the destination. The upper sublayer makes it possible to have ATM systems offer different kinds of services to 
different applications (e.g., file transfer and video on demand have different requirements concerning error 
handling, timing, etc.). 


As it is probably mostly downhill for ATM from now on, we will not discuss it further in this book. Nevertheless, 
since it has a substantial installed base, it will probably be around for at least a few more years. For more 
information about ATM, see (Dobrowski and Grise, 2001; and Gadecki and Heckart, 1997). 


1.5.3 Ethernet 


Both the Internet and ATM were designed for wide area networking. However, many companies, universities, 
and other organizations have large numbers of computers that must be connected. This need gave rise to the 
local area network. In this section we will say a little bit about the most popular LAN, Ethernet. 


The story starts out in pristine Hawaii in the early 1970s. In this case, "pristine" can be interpreted as "not having 
a working telephone system." While not being interrupted by the phone all day long makes life more pleasant for 
vacationers, it did not make life more pleasant for researcher Norman Abramson and his colleagues at the 
University of Hawaii who were trying to connect users on remote islands to the main computer in Honolulu. 
Stringing their own cables under the Pacific Ocean was not in the cards, so they looked for a different solution. 
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The one they found was short-range radios. Each user terminal was equipped with a small radio having two 
frequencies: upstream (to the central computer) and downstream (from the central computer). When the user 
wanted to contact the computer, it just transmitted a packet containing the data in the upstream channel. If no 
one else was transmitting at that instant, the packet probably got through and was acknowledged on the 
downstream channel. If there was contention for the upstream channel, the terminal noticed the lack of 
acknowledgement and tried again. Since there was only one sender on the downstream channel (the central 
computer), there were never collisions there. This system, called ALOHANET, worked fairly well under 
conditions of low traffic but bogged down badly when the upstream traffic was heavy. 


About the same time, a student named Bob Metcalfe got his bachelor's degree at M.I.T. and then moved up the 
river to get his Ph.D. at Harvard. During his studies, he was exposed to Abramson's work. He became so 
interested in it that after graduating from Harvard, he decided to spend the summer in Hawaii working with 
Abramson before starting work at Xerox PARC (Palo Alto Research Center). When he got to PARC, he saw that 
the researchers there had designed and built what would later be called personal computers. But the machines 
were isolated. Using his knowledge of Abramson's work, he, together with his colleague David Boggs, designed 
and implemented the first local area network (Metcalfe and Boggs, 1976). 


They called the system Ethernet after the luminiferous ether, through which electromagnetic radiation was once 
thought to propagate. (When the 19th century British physicist James Clerk Maxwell discovered that 
electromagnetic radiation could be described by a wave equation, scientists assumed that space must be filled 


with some ethereal medium in which the radiation was propagating. Only after the famous Michelson-Morley 
experiment in 1887 did physicists discover that electromagnetic radiation could propagate in a vacuum.) 


The transmission medium here was not a vacuum, but a thick coaxial cable (the ether) up to 2.5 km long (with 
repeaters every 500 meters). Up to 256 machines could be attached to the system via transceivers screwed onto 
the cable. A cable with multiple machines attached to it in parallel is called a multidrop cable. The system ran at 
2.94 Mbps. A sketch of its architecture is given in Fig. 1-34. Ethernet had a major improvement over 
ALOHANET: before transmitting, a computer first listened to the cable to see if someone else was already 
transmitting. If so, the computer held back until the current transmission finished. Doing so avoided interfering 
with existing transmissions, giving a much higher efficiency. ALOHANET did not work like this because it was 
impossible for a terminal on one island to sense the transmission of a terminal on a distant island. With a single 
cable, this problem does not exist. 


Figure 1-34. Architecture of the original Ethernet. 
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Despite the computer listening before transmitting, a problem still arises: what happens if two or more computers 
all wait until the current transmission completes and then all start at once? The solution is to have each 
computer listen during its own transmission and if it detects interference, jam the ether to alert all senders. Then 
back off and wait a random time before retrying. If a second collision happens, the random waiting time is 
doubled, and so on, to spread out the competing transmissions and give one of them a chance to go first. 


The Xerox Ethernet was so successful that DEC, Intel, and Xerox drew up a standard in 1978 for a 10-Mbps 
Ethernet, called the DIX standard. With two minor changes, the DIX standard became the IEEE 802.3 standard 
in 1983. 


Unfortunately for Xerox, it already had a history of making seminal inventions (such as the personal computer) 
and then failing to commercialize on them, a story told in Fumbling the Future (Smith and Alexander, 1988). 
When Xerox showed little interest in doing anything with Ethernet other than helping standardize it, Metcalfe 
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formed his own company, 3Com, to sell Ethernet adapters for PCs. It has sold over 100 million of them. 


Ethernet continued to develop and is still developing. New versions at 100 Mbps, 1000 Mbps, and still higher 
have come out. Also the cabling has improved, and switching and other features have been added. We will 
discuss Ethernet in detail in Chap. 4. 


In passing, it is worth mentioning that Ethernet (IEEE 802.3) is not the only LAN standard. The committee also 
standardized a token bus (802.4) and a token ring (802.5). The need for three more-or-less incompatible 
standards has little to do with technology and everything to do with politics. At the time of standardization, 
General Motors was pushing a LAN in which the topology was the same as Ethernet (a linear cable) but 
computers took turns in transmitting by passing a short packet called a token from computer to computer. A 
computer could only send if it possessed the token, thus avoiding collisions. General Motors announced that this 
scheme was essential for manufacturing cars and was not prepared to budge from this position. This 
announcement notwithstanding, 802.4 has basically vanished from sight. 


Similarly, IBM had its own favorite: its proprietary token ring. The token was passed around the ring and 
whichever computer held the token was allowed to transmit before putting the token back on the ring. Unlike 
802.4, this scheme, standardized as 802.5, is still in use at some IBM sites, but virtually nowhere outside of IBM 
sites. However, work is progressing on a gigabit version (802.5v), but it seems unlikely that it will ever catch up 
with Ethernet. In short, there was a war between Ethernet, token bus, and token ring, and Ethernet won, mostly 
because it was there first and the challengers were not as good. 


1.5.4 Wireless LANs: 802.11 


Almost as soon as notebook computers appeared, many people had a dream of walking into an office and 
magically having their notebook computer be connected to the Internet. Consequently, various groups began 
working on ways to accomplish this goal. The most practical approach is to equip both the office and the 
notebook computers with short-range radio transmitters and receivers to allow them to communicate. This work 
rapidly led to wireless LANs being marketed by a variety of companies. 


The trouble was that no two of them were compatible. This proliferation of standards meant that a computer 
equipped with a brand X radio would not work in a room equipped with a brand Y base station. Finally, the 
industry decided that a wireless LAN standard might be a good idea, so the IEEE committee that standardized 
the wired LANs was given the task of drawing up a wireless LAN standard. The standard it came up with was 
named 802.11. A common slang name for it is WiFi. It is an important standard and deserves respect, so we will 
call it by its proper name, 802.11. 


The proposed standard had to work in two modes: 


1. Inthe presence of a base station. 
2. Inthe absence of a base station. 


In the former case, all communication was to go through the base station, called an access point in 802.11 
terminology. In the latter case, the computers would just send to one another directly. This mode is now 
sometimes called ad hoc networking. A typical example is two or more people sitting down together in a room 
not equipped with a wireless LAN and having their computers just communicate directly. The two modes are 
illustrated in Fig. 1-35. 


Figure 1-35. (a) Wireless networking with a base station. (b) Ad hoc networking. 
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The first decision was the easiest: what to call it. All the other LAN standards had numbers like 802.1, 802.2, 
802.3, up to 802.10, so the wireless LAN standard was dubbed 802.11. The rest was harder. 


In particular, some of the many challenges that had to be met were: finding a suitable frequency band that was 
available, preferably worldwide; dealing with the fact that radio signals have a finite range; ensuring that users' 
privacy was maintained; taking limited battery life into account; worrying about human safety (do radio waves 
cause cancer?); understanding the implications of computer mobility; and finally, building a system with enough 
bandwidth to be economically viable. 


At the time the standardization process started (mid-1990s), Ethernet had already come to dominate local area 
networking, so the committee decided to make 802.11 compatible with Ethernet above the data link layer. In 
particular, it should be possible to send an IP packet over the wireless LAN the same way a wired computer sent 
an IP packet over Ethernet. Nevertheless, in the physical and data link layers, several inherent differences with 
Ethernet exist and had to be dealt with by the standard. 


First, a computer on Ethernet always listens to the ether before transmitting. Only if the ether is idle does the 
computer begin transmitting. With wireless LANs, that idea does not work so well. To see why, examine Fig. 1- 
36. Suppose that computer A is transmitting to computer B, but the radio range of A's transmitter is too short to 


reach computer C. If C wants to transmit to B it can listen to the ether before starting, but the fact that it does not 
hear anything does not mean that its transmission will succeed. The 802.11 standard had to solve this problem. 


Figure 1-36. The range of a single radio may not cover the entire system. 


The second problem that had to be solved is that a radio signal can be reflected off solid objects, so it may be 
received multiple times (along multiple paths). This interference results in what is called multipath fading. 


The third problem is that a great deal of software is not aware of mobility. For example, many word processors 
have a list of printers that users can choose from to print a file. When the computer on which the word processor 
runs is taken into a new environment, the built-in list of printers becomes invalid. 


The fourth problem is that if a notebook computer is moved away from the ceiling-mounted base station it is 
using and into the range of a different base station, some way of handing it off is needed. Although this problem 
occurs with cellular telephones, it does not occur with Ethernet and needed to be solved. In particular, the 
network envisioned consists of multiple cells, each with its own base station, but with the base stations 
connected by Ethernet, as shown in Fig. 1-37. From the outside, the entire system should look like a single 
Ethernet. The connection between the 802.11 system and the outside world is called a portal. 
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Figure 1-37. A multicell 802.11 network. 
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After some work, the committee came up with a standard in 1997 that addressed these and other concerns. The 
wireless LAN it described ran at either 1 Mbps or 2 Mbps. Almost immediately, people complained that it was too 
slow, so work began on faster standards. A split developed within the committee, resulting in two new standards 
in 1999. The 802.11a standard uses a wider frequency band and runs at speeds up to 54 Mbps. The 802.11b 
standard uses the same frequency band as 802.11, but uses a different modulation technique to achieve 11 
Mbps. Some people see this as psychologically important since 11 Mbps is faster than the original wired 
Ethernet. It is likely that the original 1-Mbps 802.11 will die off quickly, but it is not yet clear which of the new 
standards will win out. 


To make matters even more complicated than they already were, the 802 committee has come up with yet 
another variant, 802.119, which uses the modulation technique of 802.11a but the frequency band of 802.11b. 
We will come back to 802.11 in detail in Chap. 4. 


That 802.11 is going to cause a revolution in computing and Internet access is now beyond any doubt. Airports, 
train stations, hotels, shopping malls, and universities are rapidly installing it. Even upscale coffee shops are 
installing 802.11 so that the assembled yuppies can surf the Web while drinking their lattes. It is likely that 
802.11 will do to the Internet what notebook computers did to computing: make it mobile. 
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Chapter 2. The Physical Layer 


2.1 Guided Transmission Media 


The purpose of the physical layer is to transport a raw bit stream from one machine to another. Various physical 
media can be used for the actual transmission. Each one has its own niche in terms of bandwidth, delay, cost, 
and ease of installation and maintenance. Media are roughly grouped into guided media, such as copper wire 
and fiber optics, and unguided media, such as radio and lasers through the air. We will look at all of these in the 
following sections. 


2.1.1 Magnetic Media 


One of the most common ways to transport data from one computer to another is to write them onto magnetic 
tape or removable media (e.g., recordable DVDs), physically transport the tape or disks to the destination 
machine, and read them back in again. Although this method is not as sophisticated as using a geosynchronous 
communication satellite, it is often more cost effective, especially for applications in which high bandwidth or cost 
per bit transported is the key factor. 


A simple calculation will make this point clear. An industry standard Ultrium tape can hold 200 gigabytes. A box 
60 x 60 x 60 cm can hold about 1000 of these tapes, for a total capacity of 200 terabytes, or 1600 terabits (1.6 
petabits). A box of tapes can be delivered anywhere in the United States in 24 hours by Federal Express and 
other companies. The effective bandwidth of this transmission is 1600 terabits/86,400 sec, or 19 Gbps. If the 
destination is only an hour away by road, the bandwidth is increased to over 400 Gbps. No computer network 
can even approach this. 


For a bank with many gigabytes of data to be backed up daily on a second machine (so the bank can continue to 
function even in the face of a major flood or earthquake), it is likely that no other transmission technology can 
even begin to approach magnetic tape for performance. Of course, networks are getting faster, but tape 
densities are increasing, too. 


If we now look at cost, we get a similar picture. The cost of an Ultrium tape is around $40 when bought in bulk. A 
tape can be reused at least ten times, so the tape cost is maybe $4000 per box per usage. Add to this another 
$1000 for shipping (probably much less), and we have a cost of roughly $5000 to ship 200 TB. This amounts to 
shipping a gigabyte for under 3 cents. No network can beat that. The moral of the story is: 


Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway 
2.1.2 Twisted Pair 


Although the bandwidth characteristics of magnetic tape are excellent, the delay characteristics are poor. 
Transmission time is measured in minutes or hours, not milliseconds. For many applications an on-line 
connection is needed. One of the oldest and still most common transmission media is twisted pair. A twisted pair 
consists of two insulated copper wires, typically about 1 mm thick. The wires are twisted together in a helical 
form, just like a DNA molecule. Twisting is done because two parallel wires constitute a fine antenna. When the 
wires are twisted, the waves from different twists cancel out, so the wire radiates less effectively. 


The most common application of the twisted pair is the telephone system. Nearly all telephones are connected 
to the telephone company (telco) office by a twisted pair. Twisted pairs can run several kilometers without 
amplification, but for longer distances, repeaters are needed. When many twisted pairs run in parallel for a 
substantial distance, such as all the wires coming from an apartment building to the telephone company office, 
they are bundled together and encased in a protective sheath. The pairs in these bundles would interfere with 
one another if it were not for the twisting. In parts of the world where telephone lines run on poles above ground, 
itis common to see bundles several centimeters in diameter. 


Twisted pairs can be used for transmitting either analog or digital signals. The bandwidth depends on the 
thickness of the wire and the distance traveled, but several megabits/sec can be achieved for a few kilometers in 
many cases. Due to their adequate performance and low cost, twisted pairs are widely used and are likely to 
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remain so for years to come. 


Twisted pair cabling comes in several varieties, two of which are important for computer networks. Category 3 
twisted pairs consist of two insulated wires gently twisted together. Four such pairs are typically grouped in a 
plastic sheath to protect the wires and keep them together. Prior to about 1988, most office buildings had one 
category 3 cable running from a central wiring closet on each floor into each office. This scheme allowed up to 
four regular telephones or two multiline telephones in each office to connect to the telephone company 
equipment in the wiring closet. 


Starting around 1988, the more advanced category 5 twisted pairs were introduced. They are similar to category 
3 pairs, but with more twists per centimeter, which results in less crosstalk and a better-quality signal over longer 
distances, making them more suitable for high-speed computer communication. Up-and-coming categories are 6 
and 7, which are capable of handling signals with bandwidths of 250 MHz and 600 MHz, respectively (versus a 
mere 16 MHz and 100 MHz for categories 3 and 5, respectively). 


All of these wiring types are often referred to as UTP (Unshielded Twisted Pair), to contrast them with the bulky, 
expensive, shielded twisted pair cables IBM introduced in the early 1980s, but which have not proven popular 
outside of IBM installations. Twisted pair cabling is illustrated in Fig. 2-3. 


Figure 2-3. (a) Category 3 UTP. (b) Category 5 UTP. 


(a) (b) 
2.1.3 Coaxial Cable 


Another common transmission medium is the coaxial cable (known to its many friends as just "coax" and 
pronounced "co-ax"). It has better shielding than twisted pairs, so it can span longer distances at higher speeds. 
Two kinds of coaxial cable are widely used. One kind, 50-ohm cable, is commonly used when it is intended for 
digital transmission from the start. The other kind, 75-ohm cable, is commonly used for analog transmission and 
cable television but is becoming more important with the advent of Internet over cable. This distinction is based 
on historical, rather than technical, factors (e.g., early dipole antennas had an impedance of 300 ohms, and it 
was easy to use existing 4:1 impedance matching transformers). 


A coaxial cable consists of a stiff copper wire as the core, surrounded by an insulating material. The insulator is 
encased by a cylindrical conductor, often as a closely-woven braided mesh. The outer conductor is covered in a 
protective plastic sheath. A cutaway view of a coaxial cable is shown in Fig. 2-4. 


Figure 2-4. A coaxial cable. 
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The construction and shielding of the coaxial cable give it a good combination of high bandwidth and excellent 
noise immunity. The bandwidth possible depends on the cable quality, length, and signal-to-noise ratio of the 
data signal. Modern cables have a bandwidth of close to 1 GHz. Coaxial cables used to be widely used within 
the telephone system for long-distance lines but have now largely been replaced by fiber optics on long-haul 
routes. Coax is still widely used for cable television and metropolitan area networks, however. 


2.1.4 Fiber Optics 


Many people in the computer industry take enormous pride in how fast computer technology is improving. The 
original (1981) IBM PC ran at a clock speed of 4.77 MHz. Twenty years later, PCs could run at 2 GHz, a gain of 
a factor of 20 per decade. Not too bad. 
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In the same period, wide area data communication went from 56 kbps (the ARPANET) to 1 Gbps (modern 
optical communication), a gain of more than a factor of 125 per decade, while at the same time the error rate 
went from 10° per bit to almost zero. 


Furthermore, single CPUs are beginning to approach physical limits, such as speed of light and heat dissipation 
problems. In contrast, with current fiber technology, the achievable bandwidth is certainly in excess of 50,000 
Gbps (50 Tbps) and many people are looking very hard for better technologies and materials. The current 
practical signaling limit of about 10 Gbps is due to our inability to convert between electrical and optical signals 
any faster, although in the laboratory, 100 Gbps has been achieved on a single fiber. 


In the race between computing and communication, communication won. The full implications of essentially 
infinite bandwidth (although not at zero cost) have not yet sunk in to a generation of computer scientists and 
engineers taught to think in terms of the low Nyquist and Shannon limits imposed by copper wire. The new 
conventional wisdom should be that all computers are hopelessly slow and that networks should try to avoid 
computation at all costs, no matter how much bandwidth that wastes. In this section we will study fiber optics to 
see how that transmission technology works. 


An optical transmission system has three key components: the light source, the transmission medium, and the 
detector. Conventionally, a pulse of light indicates a 1 bit and the absence of light indicates a 0 bit. The 
transmission medium is an ultra-thin fiber of glass. The detector generates an electrical pulse when light falls on 
it. By attaching a light source to one end of an optical fiber and a detector to the other, we have a unidirectional 
data transmission system that accepts an electrical signal, converts and transmits it by light pulses, and then 
reconverts the output to an electrical signal at the receiving end. 


This transmission system would leak light and be useless in practice except for an interesting principle of 
physics. When a light ray passes from one medium to another, for example, from fused silica to air, the ray is 
refracted (bent) at the silica/air boundary, as shown in Fig. 2-5(a). Here we see a light ray incident on the 
boundary at an angle a1 emerging at an angle bi. The amount of refraction depends on the properties of the two 
media (in particular, their indices of refraction). For angles of incidence above a certain critical value, the light is 
refracted back into the silica; none of it escapes into the air. Thus, a light ray incident at or above the critical 
angle is trapped inside the fiber, as shown in Fig. 2-5(b), and can propagate for many kilometers with virtually no 
loss. 


Figure 2-5. (a) Three examples of a light ray from inside a silica fiber impinging on the air/silica boundary 
at different angles. (b) Light trapped by total internal reflection. 
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The sketch of Fig. 2-5(b) shows only one trapped ray, but since any light ray incident on the boundary above the 
critical angle will be reflected internally, many different rays will be bouncing around at different angles. Each ray 
is said to have a different mode, so a fiber having this property is called a multimode fiber. 


However, if the fiber's diameter is reduced to a few wavelengths of light, the fiber acts like a wave guide, and the 
light can propagate only in a straight line, without bouncing, yielding a single-mode fiber. Single-mode fibers are 
more expensive but are widely used for longer distances. Currently available single-mode fibers can transmit 
data at 50 Gbps for 100 km without amplification. Even higher data rates have been achieved in the laboratory 
for shorter distances. 
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Transmission of Light through Fiber 


Optical fibers are made of glass, which, in turn, is made from sand, an inexpensive raw material available in 
unlimited amounts. Glassmaking was known to the ancient Egyptians, but their glass had to be no more than 1 
mm thick or the light could not shine through. Glass transparent enough to be useful for windows was developed 
during the Renaissance. The glass used for modern optical fibers is so transparent that if the oceans were full of 
it instead of water, the seabed would be as visible from the surface as the ground is from an airplane on a clear 
day. 


The attenuation of light through glass depends on the wavelength of the light (as well as on some physical 
properties of the glass). For the kind of glass used in fibers, the attenuation is shown in Fig. 2-6 in decibels per 
linear kilometer of fiber. The attenuation in decibels is given by the formula 


Figure 2-6. Attenuation of light through fiber in the infrared region. 
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For example, a factor of two loss gives an attenuation of 10 logio 2 = 3 dB. The figure shows the near infrared 
part of the spectrum, which is what is used in practice. Visible light has slightly shorter wavelengths, from 0.4 to 
0.7 microns (1 micron is 108 meters). The true metric purist would refer to these wavelengths as 400 nm to 700 
nm, but we will stick with traditional usage. 


Three wavelength bands are used for optical communication. They are centered at 0.85, 1.30, and 1.55 microns, 
respectively. The last two have good attenuation properties (less than 5 percent loss per kilometer). The 0.85 
micron band has higher attenuation, but at that wavelength the lasers and electronics can be made from the 
same material (gallium arsenide). All three bands are 25,000 to 30,000 GHz wide. 


Light pulses sent down a fiber spread out in length as they propagate. This spreading is called chromatic 
dispersion. The amount of it is wavelength dependent. One way to keep these spread-out pulses from 
overlapping is to increase the distance between them, but this can be done only by reducing the signaling rate. 


Fortunately, it has been discovered that by making the pulses in a special shape related to the reciprocal of the 
hyperbolic cosine, nearly all the dispersion effects cancel out, and it is possible to send pulses for thousands of 
kilometers without appreciable shape distortion. These pulses are called solitons. A considerable amount of 
research is going on to take solitons out of the lab and into the field. 
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Fiber Cables 


Fiber optic cables are similar to coax, except without the braid. Figure 2-7(a) shows a single fiber viewed from 
the side. At the center is the glass core through which the light propagates. In multimode fibers, the core is 
typically 50 microns in diameter, about the thickness of a human hair. In single-mode fibers, the core is 8 to 10 
microns. 


Figure 2-7. (a) Side view of a single fiber. (b) End view of a sheath with three fibers. 
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The core is surrounded by a glass cladding with a lower index of refraction than the core, to keep all the light in 
the core. Next comes a thin plastic jacket to protect the cladding. Fibers are typically grouped in bundles, 
protected by an outer sheath. Figure 2-7(b) shows a sheath with three fibers. 


Terrestrial fiber sheaths are normally laid in the ground within a meter of the surface, where they are 
occasionally subject to attacks by backhoes or gophers. Near the shore, transoceanic fiber sheaths are buried in 
trenches by a kind of seaplow. In deep water, they just lie on the bottom, where they can be snagged by fishing 
trawlers or attacked by giant squid. 


Fibers can be connected in three different ways. First, they can terminate in connectors and be plugged into fiber 
sockets. Connectors lose about 10 to 20 percent of the light, but they make it easy to reconfigure systems. 


Second, they can be spliced mechanically. Mechanical splices just lay the two carefully-cut ends next to each 
other in a special sleeve and clamp them in place. Alignment can be improved by passing light through the 
junction and then making small adjustments to maximize the signal. Mechanical splices take trained personnel 
about 5 minutes and result in a 10 percent light loss. 


Third, two pieces of fiber can be fused (melted) to form a solid connection. A fusion splice is almost as good as a 
single drawn fiber, but even here, a small amount of attenuation occurs. 


For all three kinds of splices, reflections can occur at the point of the splice, and the reflected energy can 
interfere with the signal. 


Two kinds of light sources are typically used to do the signaling, LEDs (Light Emitting Diodes) and 
semiconductor lasers. They have different properties, as shown in Fig. 2-8. They can be tuned in wavelength by 
inserting Fabry-Perot or Mach-Zehnder interferometers between the source and the fiber. Fabry-Perot 
interferometers are simple resonant cavities consisting of two parallel mirrors. The light is incident perpendicular 
to the mirrors. The length of the cavity selects out those wavelengths that fit inside an integral number of times. 
Mach-Zehnder interferometers separate the light into two beams. The two beams travel slightly different 
distances. They are recombined at the end and are in phase for only certain wavelengths. 
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Figure 2-8. A comparison of semiconductor diodes and LEDs as light sources. 


Item LED | Semiconductor laser 
Data rate Low High 
Fiber type Multimode | Multimode or single mode 
Distance Short Long 
Lifetime Long life | Short life 
Temperature sensitivity Minor Substantial 
Cost Low cost Expensive 


The receiving end of an optical fiber consists of a photodiode, which gives off an electrical pulse when struck by 
light. The typical response time of a photodiode is 1 nsec, which limits data rates to about 1 Gbps. Thermal noise 
is also an issue, so a pulse of light must carry enough energy to be detected. By making the pulses powerful 
enough, the error rate can be made arbitrarily small. 


Fiber Optic Networks 


Fiber optics can be used for LANs as well as for long-haul transmission, although tapping into it is more complex 
than connecting to an Ethernet. One way around the problem is to realize that a ring network is really just a 
collection of point-to-point links, as shown in Fig. 2-9. The interface at each computer passes the light pulse 
stream through to the next link and also serves as a T junction to allow the computer to send and accept 
messages. 


Figure 2-9. A fiber optic ring with active repeaters. 
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Two types of interfaces are used. A passive interface consists of two taps fused onto the main fiber. One tap has 
an LED or laser diode at the end of it (for transmitting), and the other has a photodiode (for receiving). The tap 
itself is completely passive and is thus extremely reliable because a broken LED or photodiode does not break 
the ring. It just takes one computer off-line. 


The other interface type, shown in Fig. 2-9, is the active repeater. The incoming light is converted to an electrical 
signal, regenerated to full strength if it has been weakened, and retransmitted as light. The interface with the 
computer is an ordinary copper wire that comes into the signal regenerator. Purely optical repeaters are now 
being used, too. These devices do not require the optical to electrical to optical conversions, which means they 
can operate at extremely high bandwidths. 


If an active repeater fails, the ring is broken and the network goes down. On the other hand, since the signal is 
regenerated at each interface, the individual computer-to-computer links can be kilometers long, with virtually no 
limit on the total size of the ring. The passive interfaces lose light at each junction, so the number of computers 
and total ring length are greatly restricted. 
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A ring topology is not the only way to build a LAN using fiber optics. It is also possible to have hardware 
broadcasting by using the passive star construction of Fig. 2-10. In this design, each interface has a fiber 
running from its transmitter to a silica cylinder, with the incoming fibers fused to one end of the cylinder. 
Similarly, fibers fused to the other end of the cylinder are run to each of the receivers. Whenever an interface 
emits a light pulse, it is diffused inside the passive star to illuminate all the receivers, thus achieving broadcast. 


In effect, the passive star combines all the incoming signals and transmits the merged result on all lines. Since 
the incoming energy is divided among all the outgoing lines, the number of nodes in the network is limited by the 
sensitivity of the photodiodes. 


Figure 2-10. A passive star connection in a fiber optics network. 
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Comparison of Fiber Optics and Copper Wire 


It is instructive to compare fiber to copper. Fiber has many advantages. To start with, it can handle much higher 
bandwidths than copper. This alone would require its use in high-end networks. Due to the low attenuation, 
repeaters are needed only about every 50 km on long lines, versus about every 5 km for copper, a substantial 
cost saving. Fiber also has the advantage of not being affected by power surges, electromagnetic interference, 
or power failures. Nor is it affected by corrosive chemicals in the air, making it ideal for harsh factory 
environments. 


Oddly enough, telephone companies like fiber for a different reason: it is thin and lightweight. Many existing 
cable ducts are completely full, so there is no room to add new capacity. Removing all the copper and replacing 
it by fiber empties the ducts, and the copper has excellent resale value to copper refiners who see it as very high 
grade ore. Also, fiber is much lighter than copper. One thousand twisted pairs 1 km long weigh 8000 kg. Two 
fibers have more capacity and weigh only 100 kg, which greatly reduces the need for expensive mechanical 
support systems that must be maintained. For new routes, fiber wins hands down due to its much lower 
installation cost. 


Finally, fibers do not leak light and are quite difficult to tap. These properties gives fiber excellent security against 
potential wiretappers. 


On the downside, fiber is a less familiar technology requiring skills not all engineers have, and fibers can be 
damaged easily by being bent too much. Since optical transmission is inherently unidirectional, two-way 
communication requires either two fibers or two frequency bands on one fiber. Finally, fiber interfaces cost more 
than electrical interfaces. Nevertheless, the future of all fixed data communication for distances of more than a 
few meters is clearly with fiber. For a discussion of all aspects of fiber optics and their networks, see (Hecht, 
2001). 
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2.2 Wireless Transmission 


Our age has given rise to information junkies: people who need to be on-line all the time. For these mobile 
users, twisted pair, coax, and fiber optics are of no use. They need to get their hits of data for their laptop, 


notebook, shirt pocket, palmtop, or wristwatch computers without being tethered to the terrestrial communication 
infrastructure. For these users, wireless communication is the answer. In the following sections, we will look at 
wireless communication in general, as it has many other important applications besides providing connectivity to 
users who want to surf the Web from the beach. 


Some people believe that the future holds only two kinds of communication: fiber and wireless. All fixed (i.e., 
nonmobile) computers, telephones, faxes, and so on will use fiber, and all mobile ones will use wireless. 


Wireless has advantages for even fixed devices in some circumstances. For example, if running a fiber to a 
building is difficult due to the terrain (mountains, jungles, swamps, etc.), wireless may be better. It is noteworthy 
that modern wireless digital communication began in the Hawaiian Islands, where large chunks of Pacific Ocean 
separated the users and the telephone system was inadequate. 


2.2.1 The Electromagnetic Spectrum 


When electrons move, they create electromagnetic waves that can propagate through space (even in a 
vacuum). These waves were predicted by the British physicist James Clerk Maxwell in 1865 and first observed 
by the German physicist Heinrich Hertz in 1887. The number of oscillations per second of a wave is called its 
frequency, f, and is measured in Hz (in honor of Heinrich Hertz). The distance between two consecutive maxima 
(or minima) is called the wavelength, which is universally designated by the Greek letter | (lambda). 


When an antenna of the appropriate size is attached to an electrical circuit, the electromagnetic waves can be 
broadcast efficiently and received by a receiver some distance away. All wireless communication is based on 
this principle. 


In vacuum, all electromagnetic waves travel at the same speed, no matter what their frequency. This speed, 
usually called the speed of light, c, is approximately 3 x 109 m/sec, or about 1 foot (30 cm) per nanosecond. (A 
case could be made for redefining the foot as the distance light travels in a vacuum in 1 nsec rather than basing 
it on the shoe size of some long-dead king.) In copper or fiber the speed slows to about 2/3 of this value and 
becomes slightly frequency dependent. The speed of light is the ultimate speed limit. No object or signal can 
ever move faster than it. 


The fundamental relation between f, |, and c (in vacuum) is 
Equation 2 


Af=e 


Since c is a constant, if we know f, we can find |, and vice versa. As a rule of thumb, when | is in meters and f is 


in MHz, If 300. For example, 100-MHz waves are about 3 meters long, 1000-MHz waves are 0.3-meters long, 
and 0.1-meter waves have a frequency of 3000 MHz. 


The electromagnetic spectrum is shown in Fig. 2-11. The radio, microwave, infrared, and visible light portions of 
the spectrum can all be used for transmitting information by modulating the amplitude, frequency, or phase of 
the waves. Ultraviolet light, X-rays, and gamma rays would be even better, due to their higher frequencies, but 
they are hard to produce and modulate, do not propagate well through buildings, and are dangerous to living 
things. The bands listed at the bottom of Fig. 2-11 are the official ITU names and are based on the wavelengths, 
so the LF band goes from 1 km to 10 km (approximately 30 kHz to 300 kHz). The terms LF, MF, and HF refer to 
low, medium, and high frequency, respectively. Clearly, when the names were assigned, nobody expected to go 
above 10 MHz, so the higher bands were later named the Very, Ultra, Super, Extremely, and Tremendously High 
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Frequency bands. Beyond that there are no names, but Incredibly, Astonishingly, and Prodigiously high 
frequency (IHF, AHF, and PHF) would sound nice. 


Figure 2-11. The electromagnetic spectrum and its uses for communication. 
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The amount of information that an electromagnetic wave can carry is related to its bandwidth. With current 
technology, it is possible to encode a few bits per Hertz at low frequencies, but often as many as 8 at high 
frequencies, so a coaxial cable with a 750 MHz bandwidth can carry several gigabits/sec. From Fig. 2-11 it 
should now be obvious why networking people like fiber optics so much. 


If we solve Eq. (2-2) for f and differentiate with respect to |, we get 


d c 


d. 2 


If we now go to finite differences instead of differentials and only look at absolute values, we get 


Equation 2 
Af= coh 
A^ 


Thus, given the width of a wavelength band, DI, we can compute the corresponding frequency band, Df, and 
from that the data rate the band can produce. The wider the band, the higher the data rate. As an example, 
consider the 1.30-micron band of Fig. 2-6. Here we have l=1.3 x 10:6 and DI = 0.17 x 10'$,soDf is about 30 THz. 
At, say, 8 bits/Hz, we get 240 Tbps. 


Most transmissions use a narrow frequency band (i.e., Df/f « 1) to get the best reception (many watts/Hz). 
However, in some cases, a wide band is used, with two variations. In frequency hopping spread spectrum, the 
transmitter hops from frequency to frequency hundreds of times per second. It is popular for military 
communication because it makes transmissions hard to detect and next to impossible to jam. It also offers good 
resistance to multipath fading because the direct signal always arrives at the receiver first. Reflected signals 
follow a longer path and arrive later. By then the receiver may have changed frequency and no longer accepts 
signals on the previous frequency, thus eliminating interference between the direct and reflected signals. In 
recent years, this technique has also been applied commercially—both 802.11 and Bluetooth use it, for example. 
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As a curious footnote, the technique was co-invented by the Austrian-born sex goddess Hedy Lamarr, the first 
woman to appear nude in a motion picture (the 1933 Czech film Extase). Her first husband was an armaments 
manufacturer who told her how easy it was to block the radio signals then used to control torpedos. When she 


discovered that he was selling weapons to Hitler, she was horrified, disguised herself as a maid to escape him, 
and fled to Hollywood to continue her career as a movie actress. In her spare time, she invented frequency 
hopping to help the Allied war effort. Her scheme used 88 frequencies, the number of keys (and frequencies) on 
the piano. For their invention, she and her friend, the musical composer George Antheil, received U.S. patent 
2,292,387. However, they were unable to convince the U.S. Navy that their invention had any practical use and 
never received any royalties. Only years after the patent expired did it become popular. 


The other form of spread spectrum, direct sequence spread spectrum, which spreads the signal over a wide 
frequency band, is also gaining popularity in the commercial world. In particular, some second-generation mobile 
phones use it, and it will become dominant with the third generation, thanks to its good spectral efficiency, noise 
immunity, and other properties. Some wireless LANs also use it. We will come back to spread spectrum later in 
this chapter. For a fascinating and detailed history of spread spectrum communication, see (Scholtz, 1982). 


For the moment, we will assume that all transmissions use a narrow frequency band. We will now discuss how 
the various parts of the electromagnetic spectrum of Fig. 2-11 are used, starting with radio. 


2.2.2 Radio Transmission 


Radio waves are easy to generate, can travel long distances, and can penetrate buildings easily, so they are 
widely used for communication, both indoors and outdoors. Radio waves also are omnidirectional, meaning that 
they travel in all directions from the source, so the transmitter and receiver do not have to be carefully aligned 
physically. 


Sometimes omnidirectional radio is good, but sometimes it is bad. In the 1970s, General Motors decided to 
equip all its new Cadillacs with computer-controlled antilock brakes. When the driver stepped on the brake pedal, 
the computer pulsed the brakes on and off instead of locking them on hard. One fine day an Ohio Highway 
Patrolman began using his new mobile radio to call headquarters, and suddenly the Cadillac next to him began 
behaving like a bucking bronco. When the officer pulled the car over, the driver claimed that he had done 
nothing and that the car had gone crazy. 


Eventually, a pattern began to emerge: Cadillacs would sometimes go berserk, but only on major highways in 
Ohio and then only when the Highway Patrol was watching. For a long, long time General Motors could not 
understand why Cadillacs worked fine in all the other states and also on minor roads in Ohio. Only after much 
searching did they discover that the Cadillac's wiring made a fine antenna for the frequency used by the Ohio 
Highway Patrol's new radio system. 


The properties of radio waves are frequency dependent. At low frequencies, radio waves pass through obstacles 
well, but the power falls off sharply with distance from the source, roughly as 1/r? in air. At high frequencies, 
radio waves tend to travel in straight lines and bounce off obstacles. They are also absorbed by rain. At all 
frequencies, radio waves are subject to interference from motors and other electrical equipment. 


Due to radio's ability to travel long distances, interference between users is a problem. For this reason, all 
governments tightly license the use of radio transmitters, with one exception, discussed below. 


In the VLF, LF, and MF bands, radio waves follow the ground, as illustrated in Fig. 2-12(a). These waves can be 
detected for perhaps 1000 km at the lower frequencies, less at the higher ones. AM radio broadcasting uses the 
MF band, which is why the ground waves from Boston AM radio stations cannot be heard easily in New York. 
Radio waves in these bands pass through buildings easily, which is why portable radios work indoors. The main 
problem with using these bands for data communication is their low bandwidth [see Eq. (2-3)]. 
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Figure 2-12. (a) In the VLF, LF, and MF bands, radio waves follow the curvature of the earth. (b) In the HF 
band, they bounce off the ionosphere. 
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In the HF and VHF bands, the ground waves tend to be absorbed by the earth. However, the waves that reach 
the ionosphere, a layer of charged particles circling the earth at a height of 100 to 500 km, are refracted by it and 
sent back to earth, as shown in Fig. 2-12(b). Under certain atmospheric conditions, the signals can bounce 
several times. Amateur radio operators (hams) use these bands to talk long distance. The military also 
communicate in the HF and VHF bands. 


2.2.3 Microwave Transmission 


Above 100 MHz, the waves travel in nearly straight lines and can therefore be narrowly focused. Concentrating 
all the energy into a small beam by means of a parabolic antenna (like the familiar satellite TV dish) gives a 
much higher signal-to-noise ratio, but the transmitting and receiving antennas must be accurately aligned with 
each other. In addition, this directionality allows multiple transmitters lined up in a row to communicate with 
multiple receivers in a row without interference, provided some minimum spacing rules are observed. Before 
fiber optics, for decades these microwaves formed the heart of the long-distance telephone transmission system. 
In fact, MCI, one of AT&T's first competitors after it was deregulated, built its entire system with microwave 
communications going from tower to tower tens of kilometers apart. Even the company's name reflected this 
(MCI stood for Microwave Communications, Inc.). MCI has since gone over to fiber and merged with WorldCom. 


Since the microwaves travel in a straight line, if the towers are too far apart, the earth will get in the way (think 
about a San Francisco to Amsterdam link). Consequently, repeaters are needed periodically. The higher the 
towers are, the farther apart they can be. The distance between repeaters goes up very roughly with the square 
root of the tower height. For 100-meter-high towers, repeaters can be spaced 80 km apart. 


Unlike radio waves at lower frequencies, microwaves do not pass through buildings well. In addition, even 
though the beam may be well focused at the transmitter, there is still some divergence in space. Some waves 
may be refracted off low-lying atmospheric layers and may take slightly longer to arrive than the direct waves. 
The delayed waves may arrive out of phase with the direct wave and thus cancel the signal. This effect is called 
multipath fading and is often a serious problem. It is weather and frequency dependent. Some operators keep 10 
percent of their channels idle as spares to switch on when multipath fading wipes out some frequency band 
temporarily. 


The demand for more and more spectrum drives operators to yet higher frequencies. Bands up to 10 GHz are 
now in routine use, but at about 4 GHz a new problem sets in: absorption by water. These waves are only a few 
centimeters long and are absorbed by rain. This effect would be fine if one were planning to build a huge outdoor 
microwave oven for roasting passing birds, but for communication, it is a severe problem. As with multipath 
fading, the only solution is to shut off links that are being rained on and route around them. 


In summary, microwave communication is so widely used for long-distance telephone communication, mobile 
phones, television distribution, and other uses that a severe shortage of spectrum has developed. It has several 
significant advantages over fiber. The main one is that no right of way is needed, and by buying a small plot of 
ground every 50 km and putting a microwave tower on it, one can bypass the telephone system and 
communicate directly. This is how MCI managed to get started as a new long-distance telephone company so 
quickly. (Sprint went a completely different route: it was formed by the Southern Pacific Railroad, which already 
owned a large amount of right of way and just buried fiber next to the tracks.) 


71 


Microwave is also relatively inexpensive. Putting up two simple towers (may be just big poles with four guy wires) 
and putting antennas on each one may be cheaper than burying 50 km of fiber through a congested urban area 
or up over a mountain, and it may also be cheaper than leasing the telephone company's fiber, especially if the 
telephone company has not yet even fully paid for the copper it ripped out when it put in the fiber. 


The Politics of the Electromagnetic Spectrum 


To prevent total chaos, there are national and international agreements about who gets to use which 
frequencies. Since everyone wants a higher data rate, everyone wants more spectrum. National governments 
allocate spectrum for AM and FM radio, television, and mobile phones, as well as for telephone companies, 
police, maritime, navigation, military, government, and many other competing users. Worldwide, an agency of 
ITU-R (WARC) tries to coordinate this allocation so devices that work in multiple countries can be manufactured. 
However, countries are not bound by ITU-R's recommendations, and the FCC (Federal Communication 
Commission), which does the allocation for the United States, has occasionally rejected ITU-R's 
recommendations (usually because they required some politically-powerful group giving up some piece of the 
spectrum). 


Even when a piece of spectrum has been allocated to some use, such as mobile phones, there is the additional 
issue of which carrier is allowed to use which frequencies. Three algorithms were widely used in the past. The 
oldest algorithm, often called the beauty contest, requires each carrier to explain why its proposal serves the 
public interest best. Government officials then decide which of the nice stories they enjoy most. Having some 
government official award property worth billions of dollars to his favorite company often leads to bribery, 
corruption, nepotism, and worse. Furthermore, even a scrupulously honest government official who thought that 
a foreign company could do a better job than any of the national companies would have a lot of explaining to do. 


This observation led to algorithm 2, holding a lottery among the interested companies. The problem with that 
idea is that companies with no interest in using the spectrum can enter the lottery. If, say, a fast food restaurant 
or shoe store chain wins, it can resell the spectrum to a carrier at a huge profit and with no risk. 


Bestowing huge windfalls on alert, but otherwise random, companies has been severely criticized by many, 
which led to algorithm 3: auctioning off the bandwidth to the highest bidder. When England auctioned off the 
frequencies needed for third-generation mobile systems in 2000, they expected to get about $4 billion. They 
actually received about $40 billion because the carriers got into a feeding frenzy, scared to death of missing the 
mobile boat. This event switched on nearby governments' greedy bits and inspired them to hold their own 
auctions. It worked, but it also left some of the carriers with so much debt that they are close to bankruptcy. Even 
in the best cases, it will take many years to recoup the licensing fee. 


A completely different approach to allocating frequencies is to not allocate them at all. Just let everyone transmit 
at will but regulate the power used so that stations have such a short range they do not interfere with each other. 
Accordingly, most governments have set aside some frequency bands, called the ISM (Industrial, Scientific, 
Medical) bands for unlicensed usage. Garage door openers, cordless phones, radio-controlled toys, wireless 
mice, and numerous other wireless household devices use the ISM bands. To minimize interference between 
these uncoordinated devices, the FCC mandates that all devices in the ISM bands use spread spectrum 
techniques. Similar rules apply in other countries 


The location of the ISM bands varies somewhat from country to country. In the United States, for example, 
devices whose power is under 1 watt can use the bands shown in Fig. 2-13 without requiring a FCC license. The 
900-MHz band works best, but it is crowded and not available worldwide. The 2.4-GHz band is available in most 
countries, but it is subject to interference from microwave ovens and radar installations. Bluetooth and some of 
the 802.11 wireless LANs operate in this band. The 5.7-GHz band is new and relatively undeveloped, so 
equipment for it is expensive, but since 802.11a uses it, it will quickly become more popular. 


Figure 2-13. The ISM bands in the United States. 
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2.2.4 Infrared and Millimeter Waves 


Unguided infrared and millimeter waves are widely used for short-range communication. The remote controls 
used on televisions, VCRs, and stereos all use infrared communication. They are relatively directional, cheap, 
and easy to build but have a major drawback: they do not pass through solid objects (try standing between your 
remote control and your television and see if it still works). In general, as we go from long-wave radio toward 
visible light, the waves behave more and more like light and less and less like radio. 


On the other hand, the fact that infrared waves do not pass through solid walls well is also a plus. It means that 
an infrared system in one room of a building will not interfere with a similar system in adjacent rooms or 
buildings: you cannot control your neighbor's television with your remote control. Furthermore, security of 
infrared systems against eavesdropping is better than that of radio systems precisely for this reason. Therefore, 
no government license is needed to operate an infrared system, in contrast to radio systems, which must be 
licensed outside the ISM bands. Infrared communication has a limited use on the desktop, for example, 
connecting notebook computers and printers, but it is not a major player in the communication game. 


2.2.5 Lightwave Transmission 


Unguided optical signaling has been in use for centuries. Paul Revere used binary optical signaling from the Old 
North Church just prior to his famous ride. A more modern application is to connect the LANs in two buildings via 
lasers mounted on their rooftops. Coherent optical signaling using lasers is inherently unidirectional, so each 
building needs its own laser and its own photodetector. This scheme offers very high bandwidth and very low 
cost. It is also relatively easy to install and, unlike microwave, does not require an FCC license. 


The laser's strength, a very narrow beam, is also its weakness here. Aiming a laser beam 1-mm wide at a target 
the size of a pin head 500 meters away requires the marksmanship of a latter-day Annie Oakley. Usually, lenses 
are put into the system to defocus the beam slightly. 


A disadvantage is that laser beams cannot penetrate rain or thick fog, but they normally work well on sunny 
days. However, the author once attended a conference at a modern hotel in Europe at which the conference 
organizers thoughtfully provided a room full of terminals for the attendees to read their e-mail during boring 
presentations. Since the local PTT was unwilling to install a large number of telephone lines for just 3 days, the 
organizers put a laser on the roof and aimed it at their university's computer science building a few kilometers 
away. They tested it the night before the conference and it worked perfectly. At 9 a.m. the next morning, on a 
bright sunny day, the link failed completely and stayed down all day. That evening, the organizers tested it again 
very carefully, and once again it worked absolutely perfectly. The pattern repeated itself for two more days 
consistently. 


After the conference, the organizers discovered the problem. Heat from the sun during the daytime caused 
convection currents to rise up from the roof of the building, as shown in Fig. 2-14. This turbulent air diverted the 
beam and made it dance around the detector. Atmospheric "seeing" like this makes the stars twinkle (which is 
why astronomers put their telescopes on the tops of mountains—to get above as much of the atmosphere as 
possible). It is also responsible for shimmering roads on a hot day and the wavy images seen when one looks 
out above a hot radiator. 

Figure 2-14. Convection currents can interfere with laser communication systems. A bidirectional 
system with two lasers is pictured here. 
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2.3 The Public Switched Telephone Network 


When two computers owned by the same company or organization and located close to each other need to 
communicate, it is often easiest just to run a cable between them. LANs work this way. However, when the 
distances are large or there are many computers or the cables have to pass through a public road or other public 
right of way, the costs of running private cables are usually prohibitive. Furthermore, in just about every country 


in the world, stringing private transmission lines across (or underneath) public property is also illegal. 
Consequently, the network designers must rely on the existing telecommunication facilities. 


These facilities, especially the PSTN (Public Switched Telephone Network), were usually designed many years 
ago, with a completely different goal in mind: transmitting the human voice in a more-or-less recognizable form. 
Their suitability for use in computer-computer communication is often marginal at best, but the situation is rapidly 
changing with the introduction of fiber optics and digital technology. In any event, the telephone system is so 
tightly intertwined with (wide area) computer networks, that it is worth devoting some time to studying it. 


To see the order of magnitude of the problem, let us make a rough but illustrative comparison of the properties 
of a typical computer-computer connection via a local cable and via a dial-up telephone line. A cable running 
between two computers can transfer data at 10? bps, maybe more. In contrast, a dial-up line has a maximum 
data rate of 56 kbps, a difference of a factor of almost 20,000. That is the difference between a duck waddling 
leisurely through the grass and a rocket to the moon. If the dial-up line is replaced by an ADSL connection, there 
is still a factor of 1000—2000 difference. 


The trouble, of course, is that computer systems designers are used to working with computer systems and 
when suddenly confronted with another system whose performance (from their point of view) is 3 or 4 orders of 
magnitude worse, they, not surprising, devoted much time and effort to trying to figure out how to use it 
efficiently. In the following sections we will describe the telephone system and show how it works. For additional 
information about the innards of the telephone system see (Bellamy, 2000). 


2.3.1 Structure of the Telephone System 


Soon after Alexander Graham Bell patented the telephone in 1876 (just a few hours ahead of his rival, Elisha 
Gray), there was an enormous demand for his new invention. The initial market was for the sale of telephones, 
which came in pairs. It was up to the customer to string a single wire between them. The electrons returned 
through the earth. If a telephone owner wanted to talk to n other telephone owners, separate wires had to be 
strung to all n houses. Within a year, the cities were covered with wires passing over houses and trees in a wild 
jumble. It became immediately obvious that the model of connecting every telephone to every other telephone, 
as shown in Fig. 2-20(a), was not going to work. 


Figure 2-20. (a) Fully-interconnected network. (b) Centralized switch. (c) Two-level hierarchy. 


To his credit, Bell saw this and formed the Bell Telephone Company, which opened its first switching office (in 
New Haven, Connecticut) in 1878. The company ran a wire to each customer's house or office. To make a call, 
the customer would crank the phone to make a ringing sound in the telephone company office to attract the 
attention of an operator, who would then manually connect the caller to the callee by using a jumper cable. The 
model of a single switching office is illustrated in Fig. 2-20(b). 


Pretty soon, Bell System switching offices were springing up everywhere and people wanted to make long- 
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distance calls between cities, so the Bell system began to connect the switching offices. The original problem 
Soon returned: to connect every switching office to every other switching office by means of a wire between them 
quickly became unmanageable, so second-level switching offices were invented. After a while, multiple second- 
level offices were needed, as illustrated in Fig. 2-20(c). Eventually, the hierarchy grew to five levels. 


By 1890, the three major parts of the telephone system were in place: the switching offices, the wires between 
the customers and the switching offices (by now balanced, insulated, twisted pairs instead of open wires with an 
earth return), and the long-distance connections between the switching offices. While there have been 
improvements in all three areas since then, the basic Bell System model has remained essentially intact for over 
100 years. For a short technical history of the telephone system, see (Hawley, 1991). 


Prior to the 1984 breakup of AT&T, the telephone system was organized as a highly-redundant, multilevel 
hierarchy. The following description is highly simplified but gives the essential flavor nevertheless. Each 
telephone has two copper wires coming out of it that go directly to the telephone company's nearest end office 
(also called a local central office). The distance is typically 1 to 10 km, being shorter in cities than in rural areas. 
In the United States alone there are about 22,000 end offices. The two-wire connections between each 
subscriber's telephone and the end office are known in the trade as the local loop. If the world's local loops were 
stretched out end to end, they would extend to the moon and back 1000 times. 


At one time, 80 percent of AT&T's capital value was the copper in the local loops. AT&T was then, in effect, the 
world's largest copper mine. Fortunately, this fact was not widely known in the investment community. Had it 
been known, some corporate raider might have bought AT&T, terminated all telephone service in the United 
States, ripped out all the wire, and sold the wire to a copper refiner to get a quick payback. 


If a subscriber attached to a given end office calls another subscriber attached to the same end office, the 
switching mechanism within the office sets up a direct electrical connection between the two local loops. This 
connection remains intact for the duration of the call. 


If the called telephone is attached to another end office, a different procedure has to be used. Each end office 
has a number of outgoing lines to one or more nearby switching centers, called toll offices (or if they are within 
the same local area, tandem offices). These lines are called toll connecting trunks. If both the callers and 
callee's end offices happen to have a toll connecting trunk to the same toll office (a likely occurrence if they are 
relatively close by), the connection may be established within the toll office. A telephone network consisting only 
of telephones (the small dots), end offices (the large dots), and toll offices (the squares) is shown in Fig. 2-20(c). 


If the caller and callee do not have a toll office in common, the path will have to be established somewhere 
higher up in the hierarchy. Primary, sectional, and regional offices form a network by which the toll offices are 
connected. The toll, primary, sectional, and regional exchanges communicate with each other via high- 
bandwidth intertoll trunks (also called interoffice trunks). The number of different kinds of switching centers and 
their topology (e.g., can two sectional offices have a direct connection or must they go through a regional 
office?) varies from country to country depending on the country's telephone density. Figure 2-21 shows how a 
medium-distance connection might be routed. 


Figure 2-21. A typical circuit route for a medium-distance call. 
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A variety of transmission media are used for telecommunication. Local loops consist of category 3 twisted pairs 
nowadays, although in the early days of telephony, uninsulated wires spaced 25 cm apart on telephone poles 
were common. Between switching offices, coaxial cables, microwaves, and especially fiber optics are widely 
used. 
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In the past, transmission throughout the telephone system was analog, with the actual voice signal being 
transmitted as an electrical voltage from source to destination. With the advent of fiber optics, digital electronics, 
and computers, all the trunks and switches are now digital, leaving the local loop as the last piece of analog 


technology in the system. Digital transmission is preferred because it is not necessary to accurately reproduce 
an analog waveform after it has passed through many amplifiers on a long call. Being able to correctly 
distinguish a 0 from a 1 is enough. This property makes digital transmission more reliable than analog. It is also 
cheaper and easier to maintain. 


In summary, the telephone system consists of three major components: 


1. Localloops (analog twisted pairs going into houses and businesses). 
2. Trunks (digital fiber optics connecting the switching offices). 
3. Switching offices (where calls are moved from one trunk to another). 


After a short digression on the politics of telephones, we will come back to each of these three components in 
some detail. The local loops provide everyone access to the whole system, so they are critical. Unfortunately, 
they are also the weakest link in the system. For the long-haul trunks, the main issue is how to collect multiple 
calls together and send them out over the same fiber. This subject is called multiplexing, and we will study three 
different ways to do it. Finally, there are two fundamentally different ways of doing switching; we will look at both. 


2.3.2 The Politics of Telephones 


For decades prior to 1984, the Bell System provided both local and long distance service throughout most of the 
United States. In the 1970s, the U.S. Federal Government came to believe that this was an illegal monopoly and 
sued to break it up. The government won, and on January 1, 1984, AT&T was broken up into AT&T Long Lines, 
23 BOCs (Bell Operating Companies), and a few other pieces. The 23 BOCs were grouped into seven regional 
BOCs (RBOCs) to make them economically viable. The entire nature of telecommunication in the United States 
was changed overnight by court order (not by an act of Congress). 


The exact details of the divestiture were described in the so-called MFJ (Modified Final Judgment, an oxymoron 
if ever there was one—if the judgment could be modified, it clearly was not final). This event led to increased 
competition, better service, and lower long distance prices to consumers and businesses. However, prices for 
local service rose as the cross subsidies from long-distance calling were eliminated and local service had to 
become self supporting. Many other countries have now introduced competition along similar lines. 


To make it clear who could do what, the United States was divided up into 164 LATAs (Local Access and 
Transport Areas). Very roughly, a LATA is about as big as the area covered by one area code. Within a LATA, 
there was one LEC (Local Exchange Carrier) that had a monopoly on traditional telephone service within its 
area. The most important LECs were the BOCs, although some LATAs contained one or more of the 1500 
independent telephone companies operating as LECs. 


All inter-LATA traffic was handled by a different kind of company, an IXC (IntereXchange Carrier). Originally, 
AT&T Long Lines was the only serious IXC, but now WorldCom and Sprint are well-established competitors in 
the IXC business. One of the concerns at the breakup was to ensure that all the IXCs would be treated equally in 
terms of line quality, tariffs, and the number of digits their customers would have to dial to use them. The way 
this is handled is illustrated in Fig. 2-22. Here we see three example LATAs, each with several end offices. 
LATAs 2 and 3 also have a small hierarchy with tandem offices (intra-LATA toll offices). 
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Figure 2-22. The relationship of LATAs, LECs, and IXCs. All the circles are LEC switching offices. Each 
hexagon belongs to the IXC whose number is in it. 
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Any IXC that wishes to handle calls originating in a LATA can build a switching office called a POP (Point of 
Presence) there. The LEC is required to connect each IXC to every end office, either directly, as in LATAs 1 and 
3, or indirectly, as in LATA 2. Furthermore, the terms of the connection, both technical and financial, must be 
identical for all IXCs. In this way, a subscriber in, say, LATA 1, can choose which IXC to use for calling 
subscribers in LATA 3. 


As part of the MFJ, the IXCs were forbidden to offer local telephone service and the LECs were forbidden to 
offer inter-LATA telephone service, although both were free to enter any other business, such as operating fried 
chicken restaurants. In 1984, that was a fairly unambiguous statement. Unfortunately, technology has a funny 
way of making the law obsolete. Neither cable television nor mobile phones were covered by the agreement. As 
cable television went from one way to two way and mobile phones exploded in popularity, both LECs and IXCs 
began buying up or merging with cable and mobile operators. 


By 1995, Congress saw that trying to maintain a distinction between the various kinds of companies was no 
longer tenable and drafted a bill to allow cable TV companies, local telephone companies, long-distance carriers, 
and mobile operators to enter one another's businesses. The idea was that any company could then offer its 
customers a single integrated package containing cable TV, telephone, and information services and that 
different companies would compete on service and price. The bill was enacted into law in February 1996. As a 
result, some BOCs became IXCs and some other companies, such as cable television operators, began offering 
local telephone service in competition with the LECs. 


One interesting property of the 1996 law is the requirement that LECs implement local number portability. This 
means that a customer can change local telephone companies without having to get a new telephone number. 
This provision removes a huge hurdle for many people and makes them much more inclined to switch LECs, 
thus increasing competition. As a result, the U.S. telecommunications landscape is currently undergoing a 
radical restructuring. Again, many other countries are starting to follow suit. Often other countries wait to see 
how this kind of experiment works out in the U.S. If it works well, they do the same thing; if it works badly, they 
try something else. 
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2.3.3 The Local Loop: Modems, ADSL, and Wireless 


It is now time to start our detailed study of how the telephone system works. The main parts of the system are 
illustrated in Fig. 2-23. Here we see the local loops, the trunks, and the toll offices and end offices, both of which 
contain switching equipment that switches calls. An end office has up to 10,000 local loops (in the U.S. and other 
large countries). In fact, until recently, the area code + exchange indicated the end office, so (212) 601-xxxx was 
a specific end office with 10,000 subscribers, numbered 0000 through 9999. With the advent of competition for 
local service, this system was no longer tenable because multiple companies wanted to own the end office code. 
Also, the number of codes was basically used up, so complex mapping schemes had to be introduced. 


Figure 2-23. The use of both analog and digital transmission for a computer to computer call. 
Conversion is done by the modems and codecs. 
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Let us begin with the part that most people are familiar with: the two-wire local loop coming from a telephone 
company end office into houses and small businesses. The local loop is also frequently referred to as the "last 
mile," although the length can be up to several miles. It has used analog signaling for over 100 years and is 
likely to continue doing so for some years to come, due to the high cost of converting to digital. Nevertheless, 
even in this last bastion of analog transmission, change is taking place. In this section we will study the 
traditional local loop and the new developments taking place here, with particular emphasis on data 
communication from home computers. 


When a computer wishes to send digital data over an analog dial-up line, the data must first be converted to 
analog form for transmission over the local loop. This conversion is done by a device called a modem, 
something we will study shortly. At the telephone company end office the data are converted to digital form for 
transmission over the long-haul trunks. 


If the other end is a computer with a modem, the reverse conversion—digital to analog—is needed to traverse 
the local loop at the destination. This arrangement is shown in Fig. 2-23 for ISP 1 (Internet Service Provider), 
which has a bank of modems, each connected to a different local loop. This ISP can handle as many 
connections as it has modems (assuming its server or servers have enough computing power). This 
arrangement was the normal one until 56-kbps modems appeared, for reasons that will become apparent 
shortly. 


Analog signaling consists of varying a voltage with time to represent an information stream. If transmission 
media were perfect, the receiver would receive exactly the same signal that the transmitter sent. Unfortunately, 
media are not perfect, so the received signal is not the same as the transmitted signal. For digital data, this 
difference can lead to errors. 


Transmission lines suffer from three major problems: attenuation, delay distortion, and noise. Attenuation is the 

loss of energy as the signal propagates outward. The loss is expressed in decibels per kilometer. The amount of 

energy lost depends on the frequency. To see the effect of this frequency dependence, imagine a signal not as a 
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simple waveform, but as a series of Fourier components. Each component is attenuated by a different amount, 
which results in a different Fourier spectrum at the receiver. 


To make things worse, the different Fourier components also propagate at different speeds in the wire. This 
speed difference leads to distortion of the signal received at the other end. 


Another problem is noise, which is unwanted energy from sources other than the transmitter. Thermal noise is 
caused by the random motion of the electrons in a wire and is unavoidable. Crosstalk is caused by inductive 


coupling between two wires that are close to each other. Sometimes when talking on the telephone, you can 
hear another conversation in the background. That is crosstalk. Finally, there is impulse noise, caused by spikes 
on the power line or other causes. For digital data, impulse noise can wipe out one or more bits. 


Modems 


Due to the problems just discussed, especially the fact that both attenuation and propagation speed are 
frequency dependent, it is undesirable to have a wide range of frequencies in the signal. Unfortunately, the 
square waves used in digital signals have a wide frequency spectrum and thus are subject to strong attenuation 
and delay distortion. These effects make baseband (DC) signaling unsuitable except at slow speeds and over 
short distances. 


To get around the problems associated with DC signaling, especially on telephone lines, AC signaling is used. A 
continuous tone in the 1000 to 2000-Hz range, called a sine wave carrier, is introduced. Its amplitude, frequency, 
or phase can be modulated to transmit information. In amplitude modulation, two different amplitudes are used 
to represent 0 and 1, respectively. In frequency modulation, also known as frequency shift keying, two (or more) 
different tones are used. (The term keying is also widely used in the industry as a synonym for modulation.) In 
the simplest form of phase modulation, the carrier wave is systematically shifted 0 or 180 degrees at uniformly 
spaced intervals. A better scheme is to use shifts of 45, 135, 225, or 315 degrees to transmit 2 bits of information 
per time interval. Also, always requiring a phase shift at the end of every time interval, makes it is easier for the 
receiver to recognize the boundaries of the time intervals. 


Figure 2-24 illustrates the three forms of modulation. In Fig. 2-24(a) one of the amplitudes is nonzero and one is 
zero. In Fig. 2-24(b) two frequencies are used. In Fig. 2-24(c) a phase shift is either present or absent at each bit 
boundary. A device that accepts a serial stream of bits as input and produces a carrier modulated by one (or 
more) of these methods (or vice versa) is called a modem (for modulator-demodulator). The modem is inserted 
between the (digital) computer and the (analog) telephone system. 


Figure 2-24. (a) A binary signal. (b) Amplitude modulation. (c) Frequency modulation. (d) Phase 
modulation. 
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To go to higher and higher speeds, it is not possible to just keep increasing the sampling rate. The Nyquist 
theorem says that even with a perfect 3000-Hz line (which a dial-up telephone is decidedly not), there is no point 
in sampling faster than 6000 Hz. In practice, most modems sample 2400 times/sec and focus on getting more 
bits per sample. 


The number of samples per second is measured in baud. During each baud, one symbol is sent. Thus, an n- 
baud line transmits n symbols/sec. For example, a 2400-baud line sends one symbol about every 416.667 usec. 
If the symbol consists of 0 volts for a logical O and 1 volt for a logical 1, the bit rate is 2400 bps. If, however, the 
voltages 0, 1, 2, and 3 volts are used, every symbol consists of 2 bits, so a 2400-baud line can transmit 2400 
symbols/sec at a data rate of 4800 bps. Similarly, with four possible phase shifts, there are also 2 bits/symbol, so 
again here the bit rate is twice the baud rate. The latter technique is widely used and called QPSK (Quadrature 
Phase Shift Keying). 


The concepts of bandwidth, baud, symbol, and bit rate are commonly confused, so let us restate them here. The 
bandwidth of a medium is the range of frequencies that pass through it with minimum attenuation. It is a physical 
property of the medium (usually from 0 to some maximum frequency) and measured in Hz. The baud rate is the 
number of samples/sec made. Each sample sends one piece of information, that is, one symbol. The baud rate 
and symbol rate are thus the same. The modulation technique (e.g., QPSK) determines the number of 
bits/symbol. The bit rate is the amount of information sent over the channel and is equal to the number of 
symbols/sec times the number of bits/symbol. 


All advanced modems use a combination of modulation techniques to transmit multiple bits per baud. Often 
multiple amplitudes and multiple phase shifts are combined to transmit several bits/symbol. In Fig. 2-25(a), we 
see dots at 45, 135, 225, and 315 degrees with constant amplitude (distance from the origin). The phase of a dot 
is indicated by the angle a line from it to the origin makes with the positive x-axis. Fig. 2-25(a) has four valid 
combinations and can be used to transmit 2 bits per symbol. It is QPSK. 
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Figure 2-25. (a) QPSK. (b) QAM-16. (c) QAM-64. 
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In Fig. 2-25(b) we see a different modulation scheme, in which four amplitudes and four phases are used, for a 
total of 16 different combinations. This modulation scheme can be used to transmit 4 bits per symbol. It is called 
QAM-16 (Quadrature Amplitude Modulation). Sometimes the term 16-QAM is used instead. QAM-16 can be 
used, for example, to transmit 9600 bps over a 2400-baud line. 


Figure 2-25(c) is yet another modulation scheme involving amplitude and phase. It allows 64 different 
combinations, so 6 bits can be transmitted per symbol. It is called QAM-64. Higher-order QAMs also are used. 


Diagrams such as those of Fig. 2-25, which show the legal combinations of amplitude and phase, are called 
constellation diagrams. Each high-speed modem standard has its own constellation pattern and can talk only to 
other modems that use the same one (although most modems can emulate all the slower ones). 


With many points in the constellation pattern, even a small amount of noise in the detected amplitude or phase 
can result in an error and, potentially, many bad bits. To reduce the chance of an error, standards for the higher 
speeds modems do error correction by adding extra bits to each sample. The schemes are known as TCM 
(Trellis Coded Modulation). Thus, for example, the V.32 modem standard uses 32 constellation points to transmit 
4 data bits and 1 parity bit per symbol at 2400 baud to achieve 9600 bps with error correction. Its constellation 


pattern is shown in Fig. 2-26(a). The decision to "rotate" around the origin by 45 degrees was done for 
engineering reasons; the rotated and unrotated constellations have the same information capacity. 


Figure 2-26. (a) V.32 for 9600 bps. (b) V32 bis for 14,400 bps. 
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The next step above 9600 bps is 14,400 bps. It is called V.32 bis. This speed is achieved by transmitting 6 data 
bits and 1 parity bit per sample at 2400 baud. Its constellation pattern has 128 points when QAM-128 is used 
and is shown in Fig. 2-26(b). Fax modems use this speed to transmit pages that have been scanned in as bit 
maps. QAM-256 is not used in any standard telephone modems, but it is used on cable networks, as we shall 
see. 
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The next telephone modem after V.32 bis is V.34, which runs at 28,800 bps at 2400 baud with 12 data 
bits/symbol. The final modem in this series is V.34 bis which uses 14 data bits/symbol at 2400 baud to achieve 
33,600 bps. 


To increase the effective data rate further, many modems compress the data before transmitting it, to get an 
effective data rate higher than 33,600 bps. On the other hand, nearly all modems test the line before starting to 
transmit user data, and if they find the quality lacking, cut back to a speed lower than the rated maximum. Thus, 
the effective modem speed observed by the user can be lower, equal to, or higher than the official rating. 


All modern modems allow traffic in both directions at the same time (by using different frequencies for different 
directions). A connection that allows traffic in both directions simultaneously is called full duplex. A two-lane road 
is full duplex. A connection that allows traffic either way, but only one way at a time is called half duplex. A single 
railroad track is half duplex. A connection that allows traffic only one way is called simplex. A one-way street is 
simplex. Another example of a simplex connection is an optical fiber with a laser on one end and a light detector 
on the other end. 


The reason that standard modems stop at 33,600 is that the Shannon limit for the telephone system is about 35 
kbps, so going faster than this would violate the laws of physics (department of thermodynamics). To find out 
whether 56-kbps modems are theoretically possible, stay tuned. 


But why is the theoretical limit 35 kbps? It has to do with the average length of the local loops and the quality of 
these lines. The 35 kbps is determined by the average length of the local loops. In Fig. 2-23, a call originating at 
the computer on the left and terminating at ISP 1 goes over two local loops as an analog signal, once at the 
source and once at the destination. Each of these adds noise to the signal. If we could get rid of one of these 
local loops, the maximum rate would be doubled. 


ISP 2 does precisely that. It has a pure digital feed from the nearest end office. The digital signal used on the 
trunks is fed directly to ISP 2, eliminating the codecs, modems, and analog transmission on its end. Thus, when 
one end of the connection is purely digital, as it is with most ISPs now, the maximum data rate can be as high as 
70 kbps. Between two home users with modems and analog lines, the maximum is 33.6 kbps. 


The reason that 56 kbps modems are in use has to do with the Nyquist theorem. The telephone channel is about 
4000 Hz wide (including the guard bands). The maximum number of independent samples per second is thus 
8000. The number of bits per sample in the U.S. is 8, one of which is used for control purposes, allowing 56,000 
bit/sec of user data. In Europe, all 8 bits are available to users, so 64,000-bit/sec modems could have been 
used, but to get international agreement on a standard, 56,000 was chosen. 


This modem standard is called V.90. It provides for a 33.6-kbps upstream channel (user to ISP), but a 56 kbps 
downstream channel (ISP to user) because there is usually more data transport from the ISP to the user than the 
other way (e.g., requesting a Web page takes only a few bytes, but the actual page could be megabytes). In 
theory, an upstream channel wider than 33.6 kbps would have been possible, but since many local loops are too 
noisy for even 33.6 kbps, it was decided to allocate more of the bandwidth to the downstream channel to 
increase the chances of it actually working at 56 kbps. 


The next step beyond V.90 is V.92. These modems are capable of 48 kbps on the upstream channel if the line 
can handle it. They also determine the appropriate speed to use in about half of the usual 30 seconds required 
by older modems. Finally, they allow an incoming telephone call to interrupt an Internet session, provided that 
the line has call waiting service. 


Digital Subscriber Lines 


When the telephone industry finally got to 56 kbps, it patted itself on the back for a job well done. Meanwhile, the 
cable TV industry was offering speeds up to 10 Mbps on shared cables, and satellite companies were planning 
to offer upward of 50 Mbps. As Internet access became an increasingly important part of their business, the 
telephone companies (LECs) began to realize they needed a more competitive product. Their answer was to 
start offering new digital services over the local loop. Services with more bandwidth than standard telephone 
service are sometimes called broadband, although the term really is more of a marketing concept than a specific 
technical concept. 
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Initially, there were many overlapping offerings, all under the general name of xDSL (Digital Subscriber Line), for 
various x. Below we will discuss these but primarily focus on what is probably going to become the most popular 
of these services, ADSL (Asymmetric DSL). Since ADSL is still being developed and not all the standards are 
fully in place, some of the details given below may change in time, but the basic picture should remain valid. For 
more information about ADSL, see (Summers, 1999; and Vetter et al., 2000). 


The reason that modems are so slow is that telephones were invented for carrying the human voice and the 
entire system has been carefully optimized for this purpose. Data have always been stepchildren. At the point 
where each local loop terminates in the end office, the wire runs through a filter that attenuates all frequencies 
below 300 Hz and above 3400 Hz. The cutoff is not sharp—300 Hz and 3400 Hz are the 3 dB points—so the 
bandwidth is usually quoted as 4000 Hz even though the distance between the 3 dB points is 3100 Hz. Data are 
thus also restricted to this narrow band. 


The trick that makes xDSL work is that when a customer subscribes to it, the incoming line is connected to a 
different kind of switch, one that does not have this filter, thus making the entire capacity of the local loop 
available. The limiting factor then becomes the physics of the local loop, not the artificial 3100 Hz bandwidth 
created by the filter. 


Unfortunately, the capacity of the local loop depends on several factors, including its length, thickness, and 
general quality. A plot of the potential bandwidth as a function of distance is given in Fig. 2-27. This figure 
assumes that all the other factors are optimal (new wires, modest bundles, etc.). 


Figure 2-27. Bandwidth versus distance over category 3 UTP for DSL. 
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The implication of this figure creates a problem for the telephone company. When it picks a speed to offer, it is 
simultaneously picking a radius from its end offices beyond which the service cannot be offered. This means that 
when distant customers try to sign up for the service, they may be told "Thanks a lot for your interest, but you 
live 100 meters too far from the nearest end office to get the service. Could you please move?" The lower the 
chosen speed, the larger the radius and the more customers covered. But the lower the speed, the less 
attractive the service and the fewer the people who will be willing to pay for it. This is where business meets 
technology. (One potential solution is building mini end offices out in the neighborhoods, but that is an expensive 
proposition.) 


The xDSL services have all been designed with certain goals in mind. First, the services must work over the 
existing category 3 twisted pair local loops. Second, they must not affect customers' existing telephones and fax 
machines. Third, they must be much faster than 56 kbps. Fourth, they should be always on, with just a monthly 
charge but no per-minute charge. 


The initial ADSL offering was from AT&T and worked by dividing the spectrum available on the local loop, which 
is about 1.1 MHz, into three frequency bands: POTS (Plain Old Telephone Service) upstream (user to end office) 
and downstream (end office to user). The technique of having multiple frequency bands is called frequency 
division multiplexing; we will study it in detail in a later section. Subsequent offerings from other providers have 
taken a different approach, and it appears this one is likely to win out, so we will describe it below. 
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The alternative approach, called DMT (Discrete MultiTone), is illustrated in Fig. 2-28. In effect, what it does is 
divide the available 1.1 MHz spectrum on the local loop into 256 independent channels of 4312.5 Hz each. 
Channel 0 is used for POTS. Channels 1—5 are not used, to keep the voice signal and data signals from 
interfering with each other. Of the remaining 250 channels, one is used for upstream control and one is used for 
downstream control. The rest are available for user data. 


Figure 2-28. Operation of ADSL using discrete multitone modulation. 
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In principle, each of the remaining channels can be used for a full-duplex data stream, but harmonics, crosstalk, 
and other effects keep practical systems well below the theoretical limit. It is up to the provider to determine how 
many channels are used for upstream and how many for downstream. A 50—50 mix of upstream and 
downstream is technically possible, but most providers allocate something like 80%—90% of the bandwidth to the 
downstream channel since most users download more data than they upload. This choice gives rise to the "A" in 


ADSL. A common split is 32 channels for upstream and the rest downstream. It is also possible to have a few of 
the highest upstream channels be bidirectional for increased bandwidth, although making this optimization 
requires adding a special circuit to cancel echoes. 


The ADSL standard (ANSI T1.413 and ITU G.992.1) allows speeds of as much as 8 Mbps downstream and 1 
Mbps upstream. However, few providers offer this speed. Typically, providers offer 512 kbps downstream and 64 
kbps upstream (standard service) and 1 Mbps downstream and 256 kbps upstream (premium service). 


Within each channel, a modulation scheme similar to V.34 is used, although the sampling rate is 4000 baud 
instead of 2400 baud. The line quality in each channel is constantly monitored and the data rate adjusted 
continuously as needed, so different channels may have different data rates. The actual data are sent with QAM 
modulation, with up to 15 bits per baud, using a constellation diagram analogous to that of Fig. 2-25(b). With, for 
example, 224 downstream channels and 15 bits/baud at 4000 baud, the downstream bandwidth is 13.44 Mbps. 
In practice, the signal-to-noise ratio is never good enough to achieve this rate, but 8 Mbps is possible on short 
runs over high-quality loops, which is why the standard goes up this far. 


A typical ADSL arrangement is shown in Fig. 2-29. In this scheme, a telephone company technician must install 
a NID (Network Interface Device) on the customer's premises. This small plastic box marks the end of the 
telephone company's property and the start of the customer's property. Close to the NID (or sometimes 
combined with it) is a splitter, an analog filter that separates the 0-4000 Hz band used by POTS from the data. 
The POTS signal is routed to the existing telephone or fax machine, and the data signal is routed to an ADSL 
modem. The ADSL modem is actually a digital signal processor that has been set up to act as 250 QAM 
modems operating in parallel at different frequencies. Since most current ADSL modems are external, the 
computer must be connected to it at high speed. Usually, this is done by putting an Ethernet card in the 
computer and operating a very short two-node Ethernet containing only the computer and ADSL modem. 
Occasionally the USB port is used instead of Ethernet. In the future, internal ADSL modem cards will no doubt 
become available. 
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Figure 2-29. A typical ADSL equipment configuration. 
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At the other end of the wire, on the end office side, a corresponding splitter is installed. Here the voice portion of 
the signal is filtered out and sent to the normal voice switch. The signal above 26 kHz is routed to a new kind of 
device called a DSLAM (Digital Subscriber Line Access Multiplexer), which contains the same kind of digital 
signal processor as the ADSL modem. Once the digital signal has been recovered into a bit stream, packets are 
formed and sent off to the ISP. 


This complete separation between the voice system and ADSL makes it relatively easy for a telephone company 
to deploy ADSL. All that is needed is buying a DSLAM and splitter and attaching the ADSL subscribers to the 


splitter. Other high-bandwidth services (e.g., ISDN) require much greater changes to the existing switching 
equipment. 


One disadvantage of the design of Fig. 2-29 is the presence of the NID and splitter on the customer premises. 
Installing these can only be done by a telephone company technician, necessitating an expensive "truck roll" 
(i.e., sending a technician to the customer's premises). Therefore, an alternative splitterless design has also 
been standardized. It is informally called G.lite but the ITU standard number is G.992.2. It is the same as Fig. 2- 
29 but without the splitter. The existing telephone line is used as is. The only difference is that a microfilter has to 
be inserted into each telephone jack between the telephone or ADSL modem and the wire. The microfilter for the 
telephone is a low-pass filter eliminating frequencies above 3400 Hz; the microfilter for the ADSL modem is a 
high-pass filter eliminating frequencies below 26 kHz. However this system is not as reliable as having a splitter, 
so G.lite can be used only up to 1.5 Mbps (versus 8 Mbps for ADSL with a splitter). G.lite still requires a splitter 
in the end office, however, but that installation does not require thousands of truck rolls. 


ADSL is just a physical layer standard. What runs on top of it depends on the carrier. Often the choice is ATM 
due to ATM's ability to manage quality of service and the fact that many telephone companies run ATM in the 
core network. 


Wireless Local Loops 


Since 1996 in the U.S. and a bit later in other countries, companies that wish to compete with the entrenched 
local telephone company (the former monopolist), called an ILEC (Incumbent LEC), are free to do so. The most 
likely candidates are long-distance telephone companies (IXCs). Any IXC wishing to get into the local phone 
business in some city must do the following things. First, it must buy or lease a building for its first end office in 
that city. Second, it must fill the end office with telephone switches and other equipment, all of which are 
available as off-the-shelf products from various vendors. Third, it must run a fiber between the end office and its 
nearest toll office so the new local customers will have access to its national network. Fourth, it must acquire 
customers, typically by advertising better service or lower prices than those of the ILEC. 
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Then the hard part begins. Suppose that some customers actually show up. How is the new local phone 
company, called a CLEC (Competitive LEC) going to connect customer telephones and computers to its shiny 
new end office? Buying the necessary rights of way and stringing wires or fibers is prohibitively expensive. Many 
CLECs have discovered a cheaper alternative to the traditional twisted-pair local loop: the WLL (Wireless Local 
Loop). 


In a certain sense, a fixed telephone using a wireless local loop is a bit like a mobile phone, but there are three 
crucial technical differences. First, the wireless local loop customer often wants high-speed Internet connectivity, 
often at speeds at least equal to ADSL. Second, the new customer probably does not mind having a CLEC 
technician install a large directional antenna on his roof pointed at the CLEC's end office. Third, the user does 
not move, eliminating all the problems with mobility and cell handoff that we will study later in this chapter. And 
thus a new industry is born: fixed wireless (local telephone and Internet service run by CLECs over wireless local 
loops). 


Although WLLs began serious operation in 1998, we first have to go back to 1969 to see the origin. In that year 
the FCC allocated two television channels (at 6 MHz each) for instructional television at 2.1 GHz. In subsequent 
years, 31 more channels were added at 2.5 GHz for a total of 198 MHz. 


Instructional television never took off and in 1998, the FCC took the frequencies back and allocated them to two- 
way radio. They were immediately seized upon for wireless local loops. At these frequencies, the microwaves 
are 10-12 cm long. They have a range of about 50 km and can penetrate vegetation and rain moderately well. 
The 198 MHz of new spectrum was immediately put to use for wireless local loops as a service called MMDS 
(Multichannel Multipoint Distribution Service). MMDS can be regarded as a MAN (Metropolitan Area Network), 
as can its cousin LMDS (discussed below). 


The big advantage of this service is that the technology is well established and the equipment is readily 
available. The disadvantage is that the total bandwidth available is modest and must be shared by many users 
over a fairly large geographic area. 


The low bandwidth of MMDS led to interest in millimeter waves as an alternative. At frequencies of 28-31 GHz in 
the U.S. and 40 GHz in Europe, no frequencies were allocated because it is difficult to build silicon integrated 
circuits that operate so fast. That problem was solved with the invention of gallium arsenide integrated circuits, 
opening up millimeter bands for radio communication. The FCC responded to the demand by allocating 1.3 GHz 
to a new wireless local loop service called LMDS (Local Multipoint Distribution Service). This allocation is the 
single largest chunk of bandwidth ever allocated by the FCC for any one use. A similar chunk is being allocated 
in Europe, but at 40 GHz. 


The operation of LMDS is shown in Fig. 2-30. Here a tower is shown with multiple antennas on it, each pointing 
in a different direction. Since millimeter waves are highly directional, each antenna defines a sector, independent 
of the other ones. At this frequency, the range is 2-5 km, which means that many towers are needed to cover a 
city. 


Figure 2-30. Architecture of an LMDS system. 
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Like ADSL, LMDS uses an asymmetric bandwidth allocation favoring the downstream channel. With current 
technology, each sector can have 36 Gbps downstream and 1 Mbps upstream, shared among all the users in 
that sector. If each active user downloads three 5-KB pages per minute, the user is occupying an average of 
2000 bps of spectrum, which allows a maximum of 18,000 active users per sector. To keep the delay 
reasonable, no more than 9000 active users should be supported, though. With four sectors, as shown in Fig. 2- 
30, an active user population of 36,000 could be supported. Assuming that one in three customers is on line 
during peak periods, a single tower with four antennas could serve 100,000 people within a 5-km radius of the 
tower. These calculations have been done by many potential CLECs, some of whom have concluded that for a 
modest investment in millimeter-wave towers, they can get into the local telephone and Internet business and 
offer users data rates comparable to cable TV and at a lower price. 


LMDS has a few problems, however. For one thing, millimeter waves propagate in straight lines, so there must 
be a clear line of sight between the roof top antennas and the tower. For another, leaves absorb these waves 
well, so the tower must be high enough to avoid having trees in the line of sight. And what may have looked like 
a clear line of sight in December may not be clear in July when the trees are full of leaves. Rain also absorbs 
these waves. To some extent, errors introduced by rain can be compensated for with error correcting codes or 
turning up the power when it is raining. Nevertheless, LMDS service is more likely to be rolled out first in dry 
climates, say, in Arizona rather than in Seattle. 


Wireless local loops are not likely to catch on unless there are standards, to encourage equipment vendors to 
produce products and to ensure that customers can change CLECs without having to buy new equipment. To 
provide this standardization, IEEE set up a committee called 802.16 to draw up a standard for LMDS. The 
802.16 standard was published in April 2002. IEEE calls 802.16 a wireless MAN. 


IEEE 802.16 was designed for digital telephony, Internet access, connection of two remote LANs, television and 
radio broadcasting, and other uses. We will look at it in more detail in Chap. 4. 


2.3.4 Trunks and Multiplexing 


Economies of scale play an important role in the telephone system. It costs essentially the same amount of 
money to install and maintain a high-bandwidth trunk as a low-bandwidth trunk between two switching offices 
(i.e., the costs come from having to dig the trench and not from the copper wire or optical fiber). Consequently, 
telephone companies have developed elaborate schemes for multiplexing many conversations over a single 
physical trunk. These multiplexing schemes can be divided into two basic categories: FDM (Frequency Division 
Multiplexing) and TDM (Time Division Multiplexing). In FDM, the frequency spectrum is divided into frequency 
bands, with each user having exclusive possession of some band. In TDM, the users take turns (in a round-robin 
fashion), each one periodically getting the entire bandwidth for a little burst of time. 


AM radio broadcasting provides illustrations of both kinds of multiplexing. The allocated spectrum is about 1 
MHz, roughly 500 to 1500 kHz. Different frequencies are allocated to different logical channels (stations), each 
operating in a portion of the spectrum, with the interchannel separation great enough to prevent interference. 
This system is an example of frequency division multiplexing. In addition (in some countries), the individual 
stations have two logical subchannels: music and advertising. These two alternate in time on the same 
frequency, first a burst of music, then a burst of advertising, then more music, and so on. This situation is time 
division multiplexing. 


Below we will examine frequency division multiplexing. After that we will see how FDM can be applied to fiber 
optics (wavelength division multiplexing). Then we will turn to TDM, and end with an advanced TDM system 
used for fiber optics (SONET). 


Frequency Division Multiplexing 


Figure 2-31 shows how three voice-grade telephone channels are multiplexed using FDM. Filters limit the usable 
bandwidth to about 3100 Hz per voice-grade channel. When many channels are multiplexed together, 4000 Hz 
is allocated to each channel to keep them well separated. First the voice channels are raised in frequency, each 
by a different amount. Then they can be combined because no two channels now occupy the same portion of 
the spectrum. Notice that even though there are gaps (guard bands) between the channels, there is some 
overlap between adjacent channels because the filters do not have sharp edges. This overlap means that a 
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strong spike at the edge of one channel will be felt in the adjacent one as nonthermal noise. 


Figure 2-31. Frequency division multiplexing. (a) The original bandwidths. (b) The bandwidths raised in 
frequency. (c) The multiplexed channel. 
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The FDM schemes used around the world are to some degree standardized. A widespread standard is twelve 
4000-Hz voice channels multiplexed into the 60 to 108 kHz band. This unit is called a group. The 12-kHz to 60- 
kHz band is sometimes used for another group. Many carriers offer a 48- to 56-kbps leased line service to 
customers, based on the group. Five groups (60 voice channels) can be multiplexed to form a supergroup. The 
next unit is the mastergroup, which is five supergroups (CCITT standard) or ten supergroups (Bell system). 
Other standards of up to 230,000 voice channels also exist. 


Wavelength Division Multiplexing 


For fiber optic channels, a variation of frequency division multiplexing is used. It is called WDM (Wavelength 
Division Multiplexing). The basic principle of WDM on fibers is depicted in Fig. 2-32. Here four fibers come 
together at an optical combiner, each with its energy present at a different wavelength. The four beams are 
combined onto a single shared fiber for transmission to a distant destination. At the far end, the beam is split up 
over as many fibers as there were on the input side. Each output fiber contains a short, specially constructed 
core that filters out all but one wavelength. The resulting signals can be routed to their destination or recombined 
in different ways for additional multiplexed transport. 


Figure 2-32. Wavelength division multiplexing. 
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There is really nothing new here. This is just frequency division multiplexing at very high frequencies. As long as 
each channel has its own frequency (i.e., wavelength) range and all the ranges are disjoint, they can be 
multiplexed together on the long-haul fiber. The only difference with electrical FDM is that an optical system 
using a diffraction grating is completely passive and thus highly reliable. 


WDM technology has been progressing at a rate that puts computer technology to shame. WDM was invented 
around 1990. The first commercial systems had eight channels of 2.5 Gbps per channel. By 1998, systems with 
40 channels of 2.5 Gbps were on the market. By 2001, there were products with 96 channels of 10 Gbps, for a 
total of 960 Gbps. This is enough bandwidth to transmit 30 full-length movies per second (in MPEG-2). Systems 
with 200 channels are already working in the laboratory. When the number of channels is very large and the 
wavelengths are spaced close together, for example, 0.1 nm, the system is often referred to as DWDM (Dense 
WDM). 


It should be noted that the reason WDM is popular is that the energy on a single fiber is typically only a few 
gigahertz wide because it is currently impossible to convert between electrical and optical media any faster. By 
running many channels in parallel on different wavelengths, the aggregate bandwidth is increased linearly with 
the number of channels. Since the bandwidth of a single fiber band is about 25,000 GHz (see Fig. 2-6), there is 
theoretically room for 2500 10-Gbps channels even at 1 bit/Hz (and higher rates are also possible). 


Another new development is all optical amplifiers. Previously, every 100 km it was necessary to split up all the 
channels and convert each one to an electrical signal for amplification separately before reconverting to optical 


and combining them. Nowadays, all optical amplifiers can regenerate the entire signal once every 1000 km 
without the need for multiple opto-electrical conversions. 


In the example of Fig. 2-32, we have a fixed wavelength system. Bits from input fiber 1 go to output fiber 3, bits 
from input fiber 2 go to output fiber 1, etc. However, it is also possible to build WDM systems that are switched. 
In such a device, the output filters are tunable using Fabry-Perot or Mach-Zehnder interferometers. For more 
information about WDM and its application to Internet packet switching, see (Elmirghani and Mouftah, 2000; 
Hunter and Andonovic, 2000; and Listani et al., 2001). 


Time Division Multiplexing 


WDM technology is wonderful, but there is still a lot of copper wire in the telephone system, so let us turn back to 
it for a while. Although FDM is still used over copper wires or microwave channels, it requires analog circuitry 
and is not amenable to being done by a computer. In contrast, TDM can be handled entirely by digital 
electronics, so it has become far more widespread in recent years. Unfortunately, it can only be used for digital 
data. Since the local loops produce analog signals, a conversion is needed from analog to digital in the end 
office, where all the individual local loops come together to be combined onto outgoing trunks. 


We will now look at how multiple analog voice signals are digitized and combined onto a single outgoing digital 
trunk. Computer data sent over a modem are also analog, so the following description also applies to them. The 
analog signals are digitized in the end office by a device called a codec (coder-decoder), producing a series of 8- 
bit numbers. The codec makes 8000 samples per second (125 usec/sample) because the Nyquist theorem says 
that this is sufficient to capture all the information from the 4-kHz telephone channel bandwidth. At a lower 
sampling rate, information would be lost; at a higher one, no extra information would be gained. This technique 
is called PCM (Pulse Code Modulation). PCM forms the heart of the modern telephone system. As a 
consequence, virtually all time intervals within the telephone system are multiples of 125 usec. 


When digital transmission began emerging as a feasible technology, CCITT was unable to reach agreement on 
an international standard for PCM. Consequently, a variety of incompatible schemes are now in use in different 
countries around the world. 


The method used in North America and Japan is the T1 carrier, depicted in Fig. 2-33. (Technically speaking, the 
format is called DS1 and the carrier is called T1, but following widespread industry tradition, we will not make 
that subtle distinction here.) The T1 carrier consists of 24 voice channels multiplexed together. Usually, the 
analog signals are sampled on a round-robin basis with the resulting analog stream being fed to the codec rather 
than having 24 separate codecs and then merging the digital output. Each of the 24 channels, in turn, gets to 


89 


insert 8 bits into the output stream. Seven bits are data and one is for control, yielding 7 x 8000 = 56,000 bps of 
data, and 1 x 8000 = 8000 bps of signaling information per channel. 


Figure 2-33. The T1 carrier (1.544 Mbps). 
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A frame consists of 24 x 8 = 192 bits plus one extra bit for framing, yielding 193 bits every 125 usec. This gives a 
gross data rate of 1.544 Mbps. The 193rd bit is used for frame synchronization. It takes on the pattern 
0101010101 Normally, the receiver keeps checking this bit to make sure that it has not lost synchronization. 

If it does get out of sync, the receiver can scan for this pattern to get resynchronized. Analog customers cannot 
generate the bit pattern at all because it corresponds to a sine wave at 4000 Hz, which would be filtered out. 
Digital customers can, of course, generate this pattern, but the odds are against its being present when the 
frame slips. When a T1 system is being used entirely for data, only 23 of the channels are used for data. The 
24th one is used for a special synchronization pattern, to allow faster recovery in the event that the frame slips. 


When CCITT finally did reach agreement, they felt that 8000 bps of signaling information was far too much, so its 
1.544-Mbps standard is based on an 8- rather than a 7-bit data item; that is, the analog signal is quantized into 
256 rather than 128 discrete levels. Two (incompatible) variations are provided. In common-channel signaling, 
the extra bit (which is attached onto the rear rather than the front of the 193-bit frame) takes on the values 
10101010 in the odd frames and contains signaling information for all the channels in the even frames. 


In the other variation, channel-associated signaling, each channel has its own private signaling subchannel. A 
private subchannel is arranged by allocating one of the eight user bits in every sixth frame for signaling 
purposes, so five out of six samples are 8 bits wide, and the other one is only 7 bits wide. CCITT also 
recommended a PCM carrier at 2.048 Mbps called E1. This carrier has 32 8-bit data samples packed into the 
basic 125-usec frame. Thirty of the channels are used for information and two are used for signaling. Each group 
of four frames provides 64 signaling bits, half of which are used for channel-associated signaling and half of 
which are used for frame synchronization or are reserved for each country to use as it wishes. Outside North 
America and Japan, the 2.048-Mbps E1 carrier is used instead of T1. 


Once the voice signal has been digitized, it is tempting to try to use statistical techniques to reduce the number 
of bits needed per channel. These techniques are appropriate not only for encoding speech, but for the 
digitization of any analog signal. All of the compaction methods are based on the principle that the signal 
changes relatively slowly compared to the sampling frequency, so that much of the information in the 7- or 8-bit 
digital level is redundant. 


One method, called differential pulse code modulation, consists of outputting not the digitized amplitude, but the 
difference between the current value and the previous one. Since jumps of +16 or more on a scale of 128 are 
unlikely, 5 bits should suffice instead of 7. If the signal does occasionally jump wildly, the encoding logic may 
require several sampling periods to "catch up." For speech, the error introduced can be ignored. 


A variation of this compaction method requires each sampled value to differ from its predecessor by either +1 or 

-1. Under these conditions, a single bit can be transmitted, telling whether the new sample is above or below the 
previous one. This technique, called delta modulation, is illustrated in Fig. 2-34. Like all compaction techniques 
that assume small level changes between consecutive samples, delta encoding can get into trouble if the signal 
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changes too fast, as shown in the figure. When this happens, information is lost. 


Figure 2-34. Delta modulation. 
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An improvement to differential PCM is to extrapolate the previous few values to predict the next value and then 
to encode the difference between the actual signal and the predicted one. The transmitter and receiver must use 
the same prediction algorithm, of course. Such schemes are called predictive encoding. They are useful 
because they reduce the size of the numbers to be encoded, hence the number of bits to be sent. 


Time division multiplexing allows multiple T1 carriers to be multiplexed into higher-order carriers. Figure 2-35 
shows how this can be done. At the left we see four T1 channels being multiplexed onto one T2 channel. The 
multiplexing at T2 and above is done bit for bit, rather than byte for byte with the 24 voice channels that make up 
a T1 frame. Four T1 streams at 1.544 Mbps should generate 6.176 Mbps, but T2 is actually 6.312 Mbps. The 
extra bits are used for framing and recovery in case the carrier slips. T1 and T3 are widely used by customers, 
whereas T2 and T4 are only used within the telephone system itself, so they are not well known. 


Figure 2-35. Multiplexing T1 streams onto higher carriers. 
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At the next level, seven T2 streams are combined bitwise to form a T3 stream. Then six T3 streams are joined to 
form a T4 stream. At each step a small amount of overhead is added for framing and recovery in case the 
synchronization between sender and receiver is lost. 


Just as there is little agreement on the basic carrier between the United States and the rest of the world, there is 
equally little agreement on how it is to be multiplexed into higher-bandwidth carriers. The U.S. scheme of 
stepping up by 4, 7, and 6 did not strike everyone else as the way to go, so the CCITT standard calls for 
multiplexing four streams onto one stream at each level. Also, the framing and recovery data are different 
between the U.S. and CCITT standards. The CCITT hierarchy for 32, 128, 512, 2048, and 8192 channels runs at 
speeds of 2.048, 8.848, 34.304, 139.264, and 565.148 Mbps. 


SONET/SDH 
In the early days of fiber optics, every telephone company had its own proprietary optical TDM system. After 


AT&T was broken up in 1984, local telephone companies had to connect to multiple long-distance carriers, all 
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with different optical TDM systems, so the need for standardization became obvious. In 1985, Bellcore, the 
RBOCs research arm, began working on a standard, called SONET (Synchronous Optical NETwork). Later, 


CCITT joined the effort, which resulted in a SONET standard and a set of parallel CCITT recommendations 
(G.707, G.708, and G.709) in 1989. The CCITT recommendations are called SDH (Synchronous Digital 
Hierarchy) but differ from SONET only in minor ways. Virtually all the long-distance telephone traffic in the United 
States, and much of it elsewhere, now uses trunks running SONET in the physical layer. For additional 
information about SONET, see (Bellamy, 2000; Goralski, 2000; and Shepard, 2001). 


The SONET design had four major goals. First and foremost, SONET had to make it possible for different 
carriers to interwork. Achieving this goal required defining a common signaling standard with respect to 
wavelength, timing, framing structure, and other issues. 


Second, some means was needed to unify the U.S., European, and Japanese digital systems, all of which were 
based on 64-kbps PCM channels, but all of which combined them in different (and incompatible) ways. 


Third, SONET had to provide a way to multiplex multiple digital channels. At the time SONET was devised, the 
highest-speed digital carrier actually used widely in the United States was T3, at 44.736 Mbps. T4 was defined, 
but not used much, and nothing was even defined above T4 speed. Part of SONET's mission was to continue 
the hierarchy to gigabits/sec and beyond. A standard way to multiplex slower channels into one SONET channel 
was also needed. 


Fourth, SONET had to provide support for operations, administration, and maintenance (OAM). Previous 
systems did not do this very well. 


An early decision was to make SONET a traditional TDM system, with the entire bandwidth of the fiber devoted 
to one channel containing time slots for the various subchannels. As such, SONET is a synchronous system. It is 
controlled by a master clock with an accuracy of about 1 part in 109. Bits on a SONET line are sent out at 
extremely precise intervals, controlled by the master clock. When cell switching was later proposed to be the 
basis of ATM, the fact that it permitted irregular cell arrivals got it labeled as Asynchronous Transfer Mode to 
contrast it to the synchronous operation of SONET. With SONET, the sender and receiver are tied to a common 
clock; with ATM they are not. 


The basic SONET frame is a block of 810 bytes put out every 125 usec. Since SONET is synchronous, frames 
are emitted whether or not there are any useful data to send. Having 8000 frames/sec exactly matches the 
sampling rate of the PCM channels used in all digital telephony systems. 


The 810-byte SONET frames are best described as a rectangle of bytes, 90 columns wide by 9 rows high. Thus, 
8 x 810 = 6480 bits are transmitted 8000 times per second, for a gross data rate of 51.84 Mbps. This is the basic 
SONET channel, called STS-1 (Synchronous Transport Signal-1). All SONET trunks are a multiple of STS-1. 


The first three columns of each frame are reserved for system management information, as illustrated in Fig. 2- 
36. The first three rows contain the section overhead; the next six contain the line overhead. The section 
overhead is generated and checked at the start and end of each section, whereas the line overhead is 
generated and checked at the start and end of each line. 


Figure 2-36. Two back-to-back SONET frames. 
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A SONET transmitter sends back-to-back 810-byte frames, without gaps between them, even when there are no 
data (in which case it sends dummy data). From the receiver's point of view, all it sees is a continuous bit 
stream, so how does it know where each frame begins? The answer is that the first two bytes of each frame 
contain a fixed pattern that the receiver searches for. If it finds this pattern in the same place in a large number 
of consecutive frames, it assumes that it is in sync with the sender. In theory, a user could insert this pattern into 
the payload in a regular way, but in practice it cannot be done due to the multiplexing of multiple users into the 
same frame and other reasons. 


The remaining 87 columns hold 87 x 9 x 8 x 8000 = 50.112 Mbps of user data. However, the user data, called 
the SPE (Synchronous Payload Envelope), do not always begin in row 1, column 4. The SPE can begin 
anywhere within the frame. A pointer to the first byte is contained in the first row of the line overhead. The first 
column of the SPE is the path overhead (i.e., header for the end-to-end path sublayer protocol). 


The ability to allow the SPE to begin anywhere within the SONET frame and even to span two frames, as shown 
in Fig. 2-36, gives added flexibility to the system. For example, if a payload arrives at the source while a dummy 
SONET frame is being constructed, it can be inserted into the current frame instead of being held until the start 
of the next one. 


The SONET multiplexing hierarchy is shown in Fig. 2-37. Rates from STS-1 to STS-192 have been defined. The 
optical carrier corresponding to STS-n is called OC-n but is bit for bit the same except for a certain bit reordering 
needed for synchronization. The SDH names are different, and they start at OC-3 because CCITT-based 
systems do not have a rate near 51.84 Mbps. The OC-9 carrier is present because it closely matches the speed 
of a major high-speed trunk used in Japan. OC-18 and OC-36 are used in Japan. The gross data rate includes 
all the overhead. The SPE data rate excludes the line and section overhead. The user data rate excludes all 
overhead and counts only the 86 payload columns. 


Figure 2-37. SONET and SDH multiplex rates. 


SONET SDH Data rate (Mbps) 
Electrical Optical Optical Gross SPE User 
STS-1 OC-1 51.84 50.112 49.536 
STS-3 OC-3 STM-1 155.52 150.336 148.608 
STS-9 OC-9 STM-3 466.56 451.008 445.824 


STS-12 OC-12 STM-4 622.08 601.344 594.432 
STS-18 OC-18 STM-6 933.12 902.016 891.648 
STS-24 OC-24 STM-8 1244.16 | 1202.688 | 1188.864 
STS-36 OC-36 STM-12 1866.24 | 1804.032 | 1783.296 
STS-48 OC-48 STM-16 | 2488.32  2405.376 | 2377.728 
STS-192 OC-192  STM-64 9953.28 9621.504 9510.912 
As an aside, when a carrier, such as OC-3, is not multiplexed, but carries the data from only a single source, the 
letter c (for concatenated) is appended to the designation, so OC-3 indicates a 155.52-Mbps carrier consisting of 
three separate OC-1 carriers, but OC-3c indicates a data stream from a single source at 155.52 Mbps. The three 
OC-1 streams within an OC-3c stream are interleaved by column, first column 1 from stream 1, then column 1 


from stream 2, then column 1 from stream 3, followed by column 2 from stream 1, and so on, leading to a frame 
270 columns wide and 9 rows deep. 


2.3.5 Switching 


From the point of view of the average telephone engineer, the phone system is divided into two principal parts: 
outside plant (the local loops and trunks, since they are physically outside the switching offices) and inside plant 
(the switches), which are inside the switching offices. We have just looked at the outside plant. Now it is time to 
examine the inside plant. 


Two different switching techniques are used nowadays: circuit switching and packet switching. We will give a 
brief introduction to each of them below. Then we will go into circuit switching in detail because that is how the 
telephone system works. We will study packet switching in detail in subsequent chapters. 
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Circuit Switching 


When you or your computer places a telephone call, the switching equipment within the telephone system seeks 
out a physical path all the way from your telephone to the receiver's telephone. This technique is called circuit 
switching and is shown schematically in Fig. 2-38(a). Each of the six rectangles represents a carrier switching 
office (end office, toll office, etc.). In this example, each office has three incoming lines and three outgoing lines. 
When a call passes through a switching office, a physical connection is (conceptually) established between the 
line on which the call came in and one of the output lines, as shown by the dotted lines. 


Figure 2-38. (a) Circuit switching. (b) Packet switching. 
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In the early days of the telephone, the connection was made by the operator plugging a jumper cable into the 
input and output sockets. In fact, a surprising little story is associated with the invention of automatic circuit 
switching equipment. It was invented by a 19th century Missouri undertaker named Almon B. Strowger. Shortly 
after the telephone was invented, when someone died, one of the survivors would call the town operator and say 
"Please connect me to an undertaker." Unfortunately for Mr. Strowger, there were two undertakers in his town, 


and the other one's wife was the town telephone operator. He quickly saw that either he was going to have to 
invent automatic telephone switching equipment or he was going to go out of business. He chose the first option. 
For nearly 100 years, the circuit-switching equipment used worldwide was known as Strowger gear. (History 
does not record whether the now-unemployed switchboard operator got a job as an information operator, 
answering questions such as "What is the phone number of an undertaker?") 


The model shown in Fig. 2-39(a) is highly simplified, of course, because parts of the physical path between the 
two telephones may, in fact, be microwave or fiber links onto which thousands of calls are multiplexed. 
Nevertheless, the basic idea is valid: once a call has been set up, a dedicated path between both ends exists 
and will continue to exist until the call is finished. 
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Figure 2-39. Timing of events in (a) circuit switching, (b) message switching, (c) packet switching. 
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The alternative to circuit switching is packet switching, shown in Fig. 2-38(b). With this technology, individual 
packets are sent as need be, with no dedicated path being set up in advance. It is up to each packet to find its 
way to the destination on its own. 


An important property of circuit switching is the need to set up an end-to-end path before any data can be sent. 
The elapsed time between the end of dialing and the start of ringing can easily be 10 sec, more on long-distance 
or international calls. During this time interval, the telephone system is hunting for a path, as shown in Fig. 2- 
39(a). Note that before data transmission can even begin, the call request signal must propagate all the way to 
the destination and be acknowledged. For many computer applications (e.g., point-of-sale credit verification), 
long setup times are undesirable. 


As a consequence of the reserved path between the calling parties, once the setup has been completed, the 
only delay for data is the propagation time for the electromagnetic signal, about 5 msec per 1000 km. Also as a 
consequence of the established path, there is no danger of congestion—that is, once the call has been put 
through, you never get busy signals. Of course, you might get one before the connection has been established 
due to lack of switching or trunk capacity. 


Message Switching 


An alternative switching strategy is message switching, illustrated in Fig. 2-39(b). When this form of switching is 
used, no physical path is established in advance between sender and receiver. Instead, when the sender has a 
block of data to be sent, it is stored in the first switching office (i.e., router) and then forwarded later, one hop at a 
time. Each block is received in its entirety, inspected for errors, and then retransmitted. A network using this 
technique is called a store-and-forward network, as mentioned in Chap. 1. 


The first electromechanical telecommunication systems used message switching, namely, for telegrams. The 
message was punched on paper tape (off-line) at the sending office, and then read in and transmitted over a 
communication line to the next office along the way, where it was punched out on paper tape. An operator there 
tore the tape off and read it in on one of the many tape readers, one reader per outgoing trunk. Such a switching 
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office was called a torn tape office. Paper tape is long gone and message switching is not used any more, so we 
will not discuss it further in this book. 


Packet Switching 


With message switching, there is no limit at all on block size, which means that routers (in a modern system) 
must have disks to buffer long blocks. It also means that a single block can tie up a router-router line for minutes, 
rendering message switching useless for interactive traffic. To get around these problems, packet switching was 
invented, as described in Chap. 1. Packet-switching networks place a tight upper limit on block size, allowing 
packets to be buffered in router main memory instead of on disk. By making sure that no user can monopolize 
any transmission line very long (milliseconds), packet-switching networks are well suited for handling interactive 
traffic. A further advantage of packet switching over message switching is shown in Fig. 2-39(b) and (c): the first 
packet of a multipacket message can be forwarded before the second one has fully arrived, reducing delay and 
improving throughput. For these reasons, computer networks are usually packet switched, occasionally circuit 
switched, but never message switched. 


Circuit switching and packet switching differ in many respects. To start with, circuit switching requires that a 
circuit be set up end to end before communication begins. Packet switching does not require any advance setup. 
The first packet can just be sent as soon as it is available. 


The result of the connection setup with circuit switching is the reservation of bandwidth all the way from the 
sender to the receiver. All packets follow this path. Among other properties, having all packets follow the same 
path means that they cannot arrive out of order. With packet switching there is no path, so different packets can 
follow different paths, depending on network conditions at the time they are sent. They may arrive out of order. 


Packet switching is more fault tolerant than circuit switching. In fact, that is why it was invented. If a switch goes 
down, all of the circuits using it are terminated and no more traffic can be sent on any of them. With packet 
switching, packets can be routed around dead switches. 


Setting up a path in advance also opens up the possibility of reserving bandwidth in advance. If bandwidth is 
reserved, then when a packet arrives, it can be sent out immediately over the reserved bandwidth. With packet 
switching, no bandwidth is reserved, so packets may have to wait their turn to be forwarded. 


Having bandwidth reserved in advance means that no congestion can occur when a packet shows up (unless 
more packets show up than expected). On the other hand, when an attempt is made to establish a circuit, the 
attempt can fail due to congestion. Thus, congestion can occur at different times with circuit switching (at setup 
time) and packet switching (when packets are sent). 


If a circuit has been reserved for a particular user and there is no traffic to send, the bandwidth of that circuit is 
wasted. It cannot be used for other traffic. Packet switching does not waste bandwidth and thus is more efficient 
from a system-wide perspective. Understanding this trade-off is crucial for comprehending the difference 
between circuit switching and packet switching. The trade-off is between guaranteed service and wasting 
resources versus not guaranteeing service and not wasting resources. 


Packet switching uses store-and-forward transmission. A packet is accumulated in a router's memory, then sent 
on to the next router. With circuit switching, the bits just flow through the wire continuously. The store-and- 
forward technique adds delay. 


Another difference is that circuit switching is completely transparent. The sender and receiver can use any bit 
rate, format, or framing method they want to. The carrier does not know or care. With packet switching, the 
carrier determines the basic parameters. A rough analogy is a road versus a railroad. In the former, the user 
determines the size, speed, and nature of the vehicle; in the latter, the carrier does. It is this transparency that 
allows voice, data, and fax to coexist within the phone system. 


A final difference between circuit and packet switching is the charging algorithm. With circuit switching, charging 
has historically been based on distance and time. For mobile phones, distance usually does not play a role, 
except for international calls, and time plays only a minor role (e.g., a calling plan with 2000 free minutes costs 
more than one with 1000 free minutes and sometimes night or weekend calls are cheaper than normal). With 
packet switching, connect time is not an issue, but the volume of traffic sometimes is. For home users, ISPs 
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usually charge a flat monthly rate because it is less work for them and their customers can understand this 
model easily, but backbone carriers charge regional networks based on the volume of their traffic. The 
differences are summarized in Fig. 2-40. 


Figure 2-40. A comparison of circuit-switched and packet-switched networks. 


Item Circuit switched Packet switched 
Call setup Required Not needed 
Dedicated physical path Yes No 
Each packet follows the same route Yes No 
Packets arrive in order Yes No 
Is a switch crash fatal Yes No 
Bandwidth available Fixed Dynamic 
Time of possible congestion At setup time On every packet 
Potentially wasted bandwidth Yes No 
Store-and-forward transmission No Yes 
Transparency Yes No 
Charging Per minute Per packet 


Both circuit switching and packet switching are important enough that we will come back to them shortly and 
describe the various technologies used in detail. 


2.4 Cable Television 


We have now studied both the fixed and wireless telephone systems in a fair amount of detail. Both will clearly 
play a major role in future networks. However, an alternative available for fixed networking is now becoming a 
major player: cable television networks. Many people already get their telephone and Internet service over the 
cable, and the cable operators are actively working to increase their market share. In the following sections we 
will look at cable television as a networking system in more detail and contrast it with the telephone systems we 
have just studied. For more information about cable, see (Laubach et al., 2001; Louis, 2002; Ovadia, 2001; and 
Smith, 2002). 


2.4.1 Community Antenna Television 


Cable television was conceived in the late 1940s as a way to provide better reception to people living in rural or 
mountainous areas. The system initially consisted of a big antenna on top of a hill to pluck the television signal 
out of the air, an amplifier, called the head end, to strengthen it, and a coaxial cable to deliver it to people's 
houses, as illustrated in Fig. 2-46. 


Figure 2-46. An any cable television system. 
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In the early years, cable television was called Community Antenna Television. It was very much a mom-and-pop 
operation; anyone handy with electronics could set up a service for his town, and the users would chip in to pay 
the costs. As the number of subscribers grew, additional cables were spliced onto the original cable and 
amplifiers were added as needed. Transmission was one way, from the headend to the users. By 1970, 
thousands of independent systems existed. 


97 


In 1974, Time, Inc., started a new channel, Home Box Office, with new content (movies) and distributed only on 
cable. Other cable-only channels followed with news, sports, cooking, and many other topics. This development 
gave rise to two changes in the industry. First, large corporations began buying up existing cable systems and 
laying new cable to acquire new subscribers. Second, there was now a need to connect multiple systems, often 
in distant cities, in order to distribute the new cable channels. The cable companies began to lay cable between 
their cities to connect them all into a single system. This pattern was analogous to what happened in the 
telephone industry 80 years earlier with the connection of previously isolated end offices to make long distance 
calling possible. 


2.4.2 Internet over Cable 


Over the course of the years the cable system grew and the cables between the various cities were replaced by 
high-bandwidth fiber, similar to what was happening in the telephone system. A system with fiber for the long- 
haul runs and coaxial cable to the houses is called an HFC (Hybrid Fiber Coax) system. The electro-optical 
converters that interface between the optical and electrical parts of the system are called fiber nodes. Because 
the bandwidth of fiber is so much more than that of coax, a fiber node can feed multiple coaxial cables. Part of a 
modern HFC system is shown in Fig. 2-47(a). 


Figure 2-47. (a) Cable television. ra The fixed Ze T 
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In recent years, many cable operators have decided to get into the Internet access business, and often the 
telephony business as well. However, technical differences between the cable plant and telephone plant have an 
effect on what has to be done to achieve these goals. For one thing, all the one-way amplifiers in the system 
have to be replaced by two-way amplifiers. 
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However, there is another difference between the HFC system of Fig. 2-47(a) and the telephone system of Fig. 
2-47(b) that is much harder to remove. Down in the neighborhoods, a single cable is shared by many houses, 
whereas in the telephone system, every house has its own private local loop. When used for television 
broadcasting, this sharing does not play a role. All the programs are broadcast on the cable and it does not 
matter whether there are 10 viewers or 10,000 viewers. When the same cable is used for Internet access, it 
matters a lot if there are 10 users or 10,000. If one user decides to download a very large file, that bandwidth is 
potentially being taken away from other users. The more users, the more competition for bandwidth. The 
telephone system does not have this particular property: downloading a large file over an ADSL line does not 
reduce your neighbor's bandwidth. On the other hand, the bandwidth of coax is much higher than that of twisted 
pairs. 


The way the cable industry has tackled this problem is to split up long cables and connect each one directly to a 
fiber node. The bandwidth from the headend to each fiber node is effectively infinite, so as long as there are not 
too many subscribers on each cable segment, the amount of traffic is manageable. Typical cables nowadays 
have 500-2000 houses, but as more and more people subscribe to Internet over cable, the load may become 
too much, requiring more splitting and more fiber nodes. 


2.4.3 Spectrum Allocation 


Throwing off all the TV channels and using the cable infrastructure strictly for Internet access would probably 
generate a fair number of irate customers, so cable companies are hesitant to do this. Furthermore, most cities 
heavily regulate what is on the cable, so the cable operators would not be allowed to do this even if they really 
wanted to. As a consequence, they needed to find a way to have television and Internet coexist on the same 
cable. 


Cable television channels in North America normally occupy the 54-550 MHz region (except for FM radio from 
88 to 108 MHz). These channels are 6 MHz wide, including guard bands. In Europe the low end is usually 65 
MHz and the channels are 6-8 MHz wide for the higher resolution required by PAL and SECAM but otherwise 
the allocation scheme is similar. The low part of the band is not used. Modern cables can also operate well 
above 550 MHz, often to 750 MHz or more. The solution chosen was to introduce upstream channels in the 5— 
42 MHz band (slightly higher in Europe) and use the frequencies at the high end for the downstream. The cable 
spectrum is illustrated in Fig. 2-48. 


Figure 2-48. Frequency allocation in a typical cable TV system used for Internet access. 
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Note that since the television signals are all downstream, it is possible to use upstream amplifiers that work only 
in the 5-42 MHz region and downstream amplifiers that work only at 54 MHz and up, as shown in the figure. 
Thus, we get an asymmetry in the upstream and downstream bandwidths because more spectrum is available 
above television than below it. On the other hand, most of the traffic is likely to be downstream, so cable 
operators are not unhappy with this fact of life. As we saw earlier, telephone companies usually offer an 
asymmetric DSL service, even though they have no technical reason for doing so. 


Long coaxial cables are not any better for transmitting digital signals than are long local loops, so analog 
modulation is needed here, too. The usual scheme is to take each 6 MHz or 8 MHz downstream channel and 
modulate it with QAM-64 or, if the cable quality is exceptionally good, QAM-256. With a 6 MHz channel and 
QAM-64, we get about 36 Mbps. When the overhead is subtracted, the net payload is about 27 Mbps. With 
QAM-256, the net payload is about 39 Mbps. The European values are 1/3 larger. 
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For upstream, even QAM-64 does not work well. There is too much noise from terrestrial microwaves, CB 
radios, and other sources, so a more conservative scheme—QPSK—‘s used. This method (shown in Fig. 2-25) 
yields 2 bits per baud instead of the 6 or 8 bits QAM provides on the downstream channels. Consequently, the 
asymmetry between upstream bandwidth and downstream bandwidth is much more than suggested by Fig. 2- 
48. 


In addition to upgrading the amplifiers, the operator has to upgrade the headend, too, from a dumb amplifier to 
an intelligent digital computer system with a high-bandwidth fiber interface to an ISP. Often the name gets 
upgraded as well, from "headend" to CMTS (Cable Modem Termination System). In the following text, we will 
refrain from doing a name upgrade and stick with the traditional "headend." 


2.4.4 Cable Modems 


Internet access requires a cable modem, a device that has two interfaces on it: one to the computer and one to 
the cable network. In the early years of cable Internet, each operator had a proprietary cable modem, which was 
installed by a cable company technician. However, it soon became apparent that an open standard would create 


a competitive cable modem market and drive down prices, thus encouraging use of the service. Furthermore, 
having the customers buy cable modems in stores and install them themselves (as they do with V.9x telephone 
modems) would eliminate the dreaded truck rolls. 


Consequently, the larger cable operators teamed up with a company called CableLabs to produce a cable 
modem standard and to test products for compliance. This standard, called DOCSIS (Data Over Cable Service 
Interface Specification) is just starting to replace proprietary modems. The European version is called 
EuroDOCSIS. Not all cable operators like the idea of a standard, however, since many of them were making 
good money leasing their modems to their captive customers. An open standard with dozens of manufacturers 
selling cable modems in stores ends this lucrative practice. 


The modem-to-computer interface is straightforward. It is normally 10-Mbps Ethernet (or occasionally USB) at 
present. In the future, the entire modem might be a small card plugged into the computer, just as with V.9x 
internal modems. 


The other end is more complicated. A large part of the standard deals with radio engineering, a subject that is far 
beyond the scope of this book. The only part worth mentioning here is that cable modems, like ADSL modems, 
are always on. They make a connection when turned on and maintain that connection as long as they are 
powered up because cable operators do not charge for connect time. 


To better understand how they work, let us see what happens when a cable modem is plugged in and powered 
up. The modem scans the downstream channels looking for a special packet periodically put out by the headend 
to provide system parameters to modems that have just come on-line. Upon finding this packet, the new modem 
announces its presence on one of the upstream channels. The headend responds by assigning the modem to its 
upstream and downstream channels. These assignments can be changed later if the headend deems it 
necessary to balance the load. 


The modem then determines its distance from the headend by sending it a special packet and seeing how long it 
takes to get the response. This process is called ranging. It is important for the modem to know its distance to 
accommodate the way the upstream channels operate and to get the timing right. They are divided in time in 
minislots. Each upstream packet must fit in one or more consecutive minislots. The headend announces the start 
of a new round of minislots periodically, but the starting gun is not heard at all modems simultaneously due to 
the propagation time down the cable. By knowing how far it is from the headend, each modem can compute how 
long ago the first minislot really started. Minislot length is network dependent. A typical payload is 8 bytes. 


During initialization, the headend also assigns each modem to a minislot to use for requesting upstream 
bandwidth. As a rule, multiple modems will be assigned the same minislot, which leads to contention. When a 
computer wants to send a packet, it transfers the packet to the modem, which then requests the necessary 
number of minislots for it. If the request is accepted, the headend puts an acknowledgement on the downstream 
channel telling the modem which minislots have been reserved for its packet. The packet is then sent, starting in 
the minislot allocated to it. Additional packets can be requested using a field in the header. 


100 


On the other hand, if there is contention for the request minislot, there will be no acknowledgement and the 
modem just waits a random time and tries again. After each successive failure, the randomization time is 
doubled. (For readers already somewhat familiar with networking, this algorithm is just slotted ALOHA with 
binary exponential backoff. Ethernet cannot be used on cable because stations cannot sense the medium. We 
will come back to these issues in Chap. 4.) 


The downstream channels are managed differently from the upstream channels. For one thing, there is only one 
sender (the headend) so there is no contention and no need for minislots, which is actually just time division 
statistical multiplexing. For another, the traffic downstream is usually much larger than upstream, so a fixed 
packet size of 204 bytes is used. Part of that is a Reed-Solomon error-correcting code and some other 
overhead, leaving a user payload of 184 bytes. These numbers were chosen for compatibility with digital 
television using MPEG-2, so the TV and downstream data channels are formatted the same way. Logically, the 
connections are as depicted in Fig. 2-49. 


Figure 2-49. Typical details of the upstream and downstream channels in North America. 
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Getting back to modem initialization, once the modem has completed ranging and gotten its upstream channel, 
downstream channel, and minislot assignments, it is free to start sending packets. The first packet it sends is 
one to the ISP requesting an IP address, which is dynamically assigned using a protocol called DHCP, which we 
will study in Chap. 5. It also requests and gets an accurate time of day from the headend. 


The next step involves security. Since cable is a shared medium, anybody who wants to go to the trouble to do 
so can read all the traffic going past him. To prevent everyone from snooping on their neighbors (literally), all 
traffic is encrypted in both directions. Part of the initialization procedure involves establishing encryption keys. At 
first one might think that having two strangers, the headend and the modem, establish a secret key in broad 
daylight with thousands of people watching would be impossible. Turns out it is not, but we have to wait until 
Chap. 8 to explain how (the short answer: use the Diffie-Hellman algorithm). 


Finally, the modem has to log in and provide its unique identifier over the secure channel. At this point the 
initialization is complete. The user can now log in to the ISP and get to work. 


There is much more to be said about cable modems. Some relevant references are (Adams and Dulchinos, 
2001; Donaldson and Jones, 2001; and Dutta-Roy, 2001). 


2.4.5 ADSL versus Cable 


Which is better, ADSL or cable? That is like asking which operating system is better. Or which language is 
better. Or which religion. Which answer you get depends on whom you ask. Let us compare ADSL and cable on 
a few points. Both use fiber in the backbone, but they differ on the edge. Cable uses coax; ADSL uses twisted 
pair. The theoretical carrying capacity of coax is hundreds of times more than twisted pair. However, the full 
capacity of the cable is not available for data users because much of the cable's bandwidth is wasted on useless 
stuff such as television programs. 


In practice, it is hard to generalize about effective capacity. ADSL providers give specific statements about the 
bandwidth (e.g., 1 Mbps downstream, 256 kbps upstream) and generally achieve about 80% of it consistently. 
Cable providers do not make any claims because the effective capacity depends on how many people are 
currently active on the user's cable segment. Sometimes it may be better than ADSL and sometimes it may be 
worse. What can be annoying, though, is the unpredictability. Having great service one minute does not 
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guarantee great service the next minute since the biggest bandwidth hog in town may have just turned on his 
computer. 


As an ADSL system acquires more users, their increasing numbers have little effect on existing users, since 
each user has a dedicated connection. With cable, as more subscribers sign up for Internet service, performance 
for existing users will drop. The only cure is for the cable operator to split busy cables and connect each one to a 
fiber node directly. Doing so costs time and money, so their are business pressures to avoid it. 


As an aside, we have already studied another system with a shared channel like cable: the mobile telephone 
System. Here, too, a group of users, we could call them cellmates, share a fixed amount of bandwidth. Normally, 
it is rigidly divided in fixed chunks among the active users by FDM and TDM because voice traffic is fairly 
smooth. But for data traffic, this rigid division is very inefficient because data users are frequently idle, in which 
case their reserved bandwidth is wasted. Nevertheless, in this respect, cable access is more like the mobile 
phone system than it is like the fixed system. 


Availability is an issue on which ADSL and cable differ. Everyone has a telephone, but not all users are close 
enough to their end office to get ADSL. On the other hand, not everyone has cable, but if you do have cable and 
the company provides Internet access, you can get it. Distance to the fiber node or headend is not an issue. It is 
also worth noting that since cable started out as a television distribution medium, few businesses have it. 


Being a point-to-point medium, ADSL is inherently more secure than cable. Any cable user can easily read all 
the packets going down the cable. For this reason, any decent cable provider will encrypt all traffic in both 
directions. Nevertheless, having your neighbor get your encrypted messages is still less secure than having him 
not get anything at all. 


The telephone system is generally more reliable than cable. For example, it has backup power and continues to 
work normally even during a power outage. With cable, if the power to any amplifier along the chain fails, all 
downstream users are cut off instantly. 


Finally, most ADSL providers offer a choice of ISPs. Sometimes they are even required to do so by law. This is 
not always the case with cable operators. 


The conclusion is that ADSL and cable are much more alike than they are different. They offer comparable 
service and, as competition between them heats up, probably comparable prices. 


2.5 Summary 


The physical layer is the basis of all networks. Nature imposes two fundamental limits on all channels, and these 
determine their bandwidth. These limits are the Nyquist limit, which deals with noiseless channels, and the 
Shannon limit, which deals with noisy channels. 


Transmission media can be guided or unguided. The principal guided media are twisted pair, coaxial cable, and 
fiber optics. Unguided media include radio, microwaves, infrared, and lasers through the air. An up-and-coming 
transmission system is satellite communication, especially LEO systems. 


A key element in most wide area networks is the telephone system. Its main components are the local loops, 
trunks, and switches. Local loops are analog, twisted pair circuits, which require modems for transmitting digital 
data. ADSL offers speeds up to 50 Mbps by dividing the local loop into many virtual channels and modulating 
each one separately. Wireless local loops are another new development to watch, especially LMDS. 


Trunks are digital, and can be multiplexed in several ways, including FDM, TDM, and WDM. Both circuit 
switching and packet switching are important. 


For mobile applications, the fixed telephone system is not suitable. Mobile phones are currently in widespread 
use for voice and will soon be in widespread use for data. The first generation was analog, dominated by AMPS. 
The second generation was digital, with D-AMPS, GSM, and CDMA the major options. The third generation will 
be digital and based on broadband CDMA. 


An alternative system for network access is the cable television system, which has gradually evolved from a 
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community antenna to hybrid fiber coax. Potentially, it offers very high bandwidth, but the actual bandwidth 
available in practice depends heavily on the number of other users currently active and what they are doing. 


103 


Chapter 3. The Data Link Layer 


In this chapter we will study the design principles for layer 2, the data link layer. This study deals with the 
algorithms for achieving reliable, efficient communication between two adjacent machines at the data link layer. 
By adjacent, we mean that the two machines are connected by a communication channel that acts conceptually 
like a wire (e.g., a coaxial cable, telephone line, or point-to-point wireless channel). The essential property of a 
channel that makes it "wirelike" is that the bits are delivered in exactly the same order in which they are sent. 


At first you might think this problem is so trivial that there is no software to study—machine A just puts the bits on 
the wire, and machine B just takes them off. Unfortunately, communication circuits make errors occasionally. 
Furthermore, they have only a finite data rate, and there is a nonzero propagation delay between the time a bit is 
sent and the time it is received. These limitations have important implications for the efficiency of the data 
transfer. The protocols used for communications must take all these factors into consideration. These protocols 
are the subject of this chapter. 


After an introduction to the key design issues present in the data link layer, we will start our study of its protocols 
by looking at the nature of errors, their causes, and how they can be detected and corrected. Then we will study 
a series of increasingly complex protocols, each one solving more and more of the problems present in this 
layer. Finally, we will conclude with an examination of protocol modeling and correctness and give some 
examples of data link protocols. 


3.1 Data Link Layer Design Issues 
The data link layer has a number of specific functions it can carry out. These functions include 


1. Providing a well-defined service interface to the network layer. 
2. Dealing with transmission errors. 
3. Regulating the flow of data so that slow receivers are not swamped by fast senders. 


To accomplish these goals, the data link layer takes the packets it gets from the network layer and encapsulates 
them into frames for transmission. Each frame contains a frame header, a payload field for holding the packet, 
and a frame trailer, as illustrated in Fig. 3-1. Frame management forms the heart of what the data link layer 
does. In the following sections we will examine all the above-mentioned issues in detail. 


Figure 3-1. Relationship between packets and frames. 
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Although this chapter is explicitly about the data link layer and the data link protocols, many of the principles we 
will study here, such as error control and flow control, are found in transport and other protocols as well. In fact, 
in many networks, these functions are found only in the upper layers and not in the data link layer. However, no 
matter where they are found, the principles are pretty much the same, so it does not really matter where we 
study them. In the data link layer they often show up in their simplest and purest forms, making this a good place 
to examine them in detail. 


Header | Payload field 
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3.1.1 Services Provided to the Network Layer 


The function of the data link layer is to provide services to the network layer. The principal service is transferring 
data from the network layer on the source machine to the network layer on the destination machine. On the 
source machine is an entity, call it a process, in the network layer that hands some bits to the data link layer for 
transmission to the destination. The job of the data link layer is to transmit the bits to the destination machine so 
they can be handed over to the network layer there, as shown in Fig. 3-2(a). The actual transmission follows the 
path of Fig. 3-2(b), but it is easier to think in terms of two data link layer processes communicating using a data 
link protocol. For this reason, we will implicitly use the model of Fig. 3-2(a) throughout this chapter. 


Figure 3-2. (a) Virtual communication. (b) Actual communication. 
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The data link layer can be designed to offer various services. The actual services offered can vary from system 
to system. Three reasonable possibilities that are commonly provided are 


1. Unacknowledged connectionless service. 
2. Acknowledged connectionless service. 
3. Acknowledged connection-oriented service. 


Let us consider each of these in turn. 


Unacknowledged connectionless service consists of having the source machine send independent frames to the 
destination machine without having the destination machine acknowledge them. No logical connection is 
established beforehand or released afterward. If a frame is lost due to noise on the line, no attempt is made to 
detect the loss or recover from it in the data link layer. This class of service is appropriate when the error rate is 
very low so that recovery is left to higher layers. It is also appropriate for real-time traffic, such as voice, in which 
late data are worse than bad data. Most LANs use unacknowledged connectionless service in the data link layer. 


The next step up in terms of reliability is acknowledged connectionless service. When this service is offered, 
there are still no logical connections used, but each frame sent is individually acknowledged. In this way, the 
sender knows whether a frame has arrived correctly. If it has not arrived within a specified time interval, it can be 
sent again. This service is useful over unreliable channels, such as wireless systems. 


It is perhaps worth emphasizing that providing acknowledgements in the data link layer is just an optimization, 
never a requirement. The network layer can always send a packet and wait for it to be acknowledged. If the 
acknowledgement is not forthcoming before the timer expires, the sender can just send the entire message 
again. The trouble with this strategy is that frames usually have a strict maximum length imposed by the 
hardware and network layer packets do not. If the average packet is broken up into, say, 10 frames, and 20 
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percent of all frames are lost, it may take a very long time for the packet to get through. If individual frames are 
acknowledged and retransmitted, entire packets get through much faster. On reliable channels, such as fiber, 
the overhead of a heavyweight data link protocol may be unnecessary, but on wireless channels, with their 
inherent unreliability, it is well worth the cost. 


Getting back to our services, the most sophisticated service the data link layer can provide to the network layer 
is connection-oriented service. With this service, the source and destination machines establish a connection 
before any data are transferred. Each frame sent over the connection is numbered, and the data link layer 
guarantees that each frame sent is indeed received. Furthermore, it guarantees that each frame is received 
exactly once and that all frames are received in the right order. With connectionless service, in contrast, it is 
conceivable that a lost acknowledgement causes a packet to be sent several times and thus received several 
times. Connection-oriented service, in contrast, provides the network layer processes with the equivalent of a 
reliable bit stream. 


When connection-oriented service is used, transfers go through three distinct phases. In the first phase, the 
connection is established by having both sides initialize variables and counters needed to keep track of which 
frames have been received and which ones have not. In the second phase, one or more frames are actually 
transmitted. In the third and final phase, the connection is released, freeing up the variables, buffers, and other 
resources used to maintain the connection. 


Consider a typical example: a WAN subnet consisting of routers connected by point-to-point leased telephone 
lines. When a frame arrives at a router, the hardware checks it for errors (using techniques we will study late in 
this chapter), then passes the frame to the data link layer software (which might be embedded in a chip on the 
network interface board). The data link layer software checks to see if this is the frame expected, and if so, gives 
the packet contained in the payload field to the routing software. The routing software then chooses the 
appropriate outgoing line and passes the packet back down to the data link layer software, which then transmits 
it. The flow over two routers is shown in Fig. 3-3. 


Figure 3-3. Placement of the data link protocol. 
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The routing code frequently wants the job done right, that is, with reliable, sequenced connections on each of the 
point-to-point lines. It does not want to be bothered too often with packets that got lost on the way. It is up to the 
data link protocol, shown in the dotted rectangle, to make unreliable communication lines look perfect or, at 
least, fairly good. As an aside, although we have shown multiple copies of the data link layer software in each 
router, in fact, one copy handles all the lines, with different tables and data structures for each one. 
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3.1.2 Framing 


To provide service to the network layer, the data link layer must use the service provided to it by the physical 
layer. What the physical layer does is accept a raw bit stream and attempt to deliver it to the destination. This bit 
stream is not guaranteed to be error free. The number of bits received may be less than, equal to, or more than 
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the number of bits transmitted, and they may have different values. It is up to the data link layer to detect and, if 
necessary, correct errors. 


The usual approach is for the data link layer to break the bit stream up into discrete frames and compute the 
checksum for each frame. (Checksum algorithms will be discussed later in this chapter.) When a frame arrives at 
the destination, the checksum is recomputed. If the newly-computed checksum is different from the one 
contained in the frame, the data link layer knows that an error has occurred and takes steps to deal with it (e.g., 
discarding the bad frame and possibly also sending back an error report). 


Breaking the bit stream up into frames is more difficult than it at first appears. One way to achieve this framing is 
to insert time gaps between frames, much like the spaces between words in ordinary text. However, networks 
rarely make any guarantees about timing, so it is possible these gaps might be squeezed out or other gaps 
might be inserted during transmission. 


Since it is too risky to count on timing to mark the start and end of each frame, other methods have been 
devised. In this section we will look at four methods: 


1. Character count. 

2. Flag bytes with byte stuffing. 

3. Starting and ending flags, with bit stuffing. 
4. Physical layer coding violations. 


The first framing method uses a field in the header to specify the number of characters in the frame. When the 
data link layer at the destination sees the character count, it knows how many characters follow and hence 


where the end of the frame is. This technique is shown in Fig. 3-4(a) for four frames of sizes 5, 5, 8, and 8 
characters, respectively. 


Figure 3-4. A character stream. (a) Without errors. (b) With one error. 
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The trouble with this algorithm is that the count can be garbled by a transmission error. For example, if the 
character count of 5 in the second frame of Fig. 3-4(b) becomes a 7, the destination will get out of 
synchronization and will be unable to locate the start of the next frame. Even if the checksum is incorrect so the 
destination knows that the frame is bad, it still has no way of telling where the next frame starts. Sending a frame 
back to the source asking for a retransmission does not help either, since the destination does not know how 
many characters to skip over to get to the start of the retransmission. For this reason, the character count 
method is rarely used anymore. 
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The second framing method gets around the problem of resynchronization after an error by having each frame 
start and end with special bytes. In the past, the starting and ending bytes were different, but in recent years 
most protocols have used the same byte, called a flag byte, as both the starting and ending delimiter, as shown 
in Fig. 3-5(a) as FLAG. In this way, if the receiver ever loses synchronization, it can just search for the flag byte 
to find the end of the current frame. Two consecutive flag bytes indicate the end of one frame and start of the 
next one. 


Figure 3-5. (a) A frame delimited by flag bytes. (b) Four examples of byte sequences before and after 
byte stuffing. 


FLAG| Header Payload field | Trailer Pac] 
(a) 
Original characters After stuffing 
A FLAG B —-| A ESC | [FLAG B 
LL LC LC LL 
A ESC B ——e | A ESC | | ESC B 


A ESC||FLAG|| B | — | A ESC||ESC|| ESC 


A ESC | | ESC | B — A | ESC | | ESC | | ESC fese | B | 


A serious problem occurs with this method when binary data, such as object programs or floating-point numbers, 
are being transmitted. It may easily happen that the flag byte's bit pattern occurs in the data. This situation will 
usually interfere with the framing. One way to solve this problem is to have the sender's data link layer insert a 
special escape byte (ESC) just before each "accidental" flag byte in the data. The data link layer on the receiving 
end removes the escape byte before the data are given to the network layer. This technique is called byte 
stuffing or character stuffing. Thus, a framing flag byte can be distinguished from one in the data by the absence 
or presence of an escape byte before it. 


Of course, the next question is: What happens if an escape byte occurs in the middle of the data? The answer is 
that it, too, is stuffed with an escape byte. Thus, any single escape byte is part of an escape sequence, whereas 
a doubled one indicates that a single escape occurred naturally in the data. Some examples are shown in Fig. 3- 
5(b). In all cases, the byte sequence delivered after destuffing is exactly the same as the original byte sequence. 


The byte-stuffing scheme depicted in Fig. 3-5 is a slight simplification of the one used in the PPP protocol that 
most home computers use to communicate with their Internet service provider. We will discuss PPP later in this 
chapter. 


A major disadvantage of using this framing method is that it is closely tied to the use of 8-bit characters. Not all 
character codes use 8-bit characters. For example. UNICODE uses 16-bit characters, As networks developed, 
the disadvantages of embedding the character code length in the framing mechanism became more and more 
obvious, so a new technique had to be developed to allow arbitrary sized characters. 


The new technique allows data frames to contain an arbitrary number of bits and allows character codes with an 
arbitrary number of bits per character. It works like this. Each frame begins and ends with a special bit pattern, 
01111110 (in fact, a flag byte). Whenever the sender's data link layer encounters five consecutive 1s in the data, 
it automatically stuffs a 0 bit into the outgoing bit stream. This bit stuffing is analogous to byte stuffing, in which 
an escape byte is stuffed into the outgoing character stream before a flag byte in the data. 
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When the receiver sees five consecutive incoming 1 bits, followed by a 0 bit, it automatically destuffs (i.e., 
deletes) the 0 bit. Just as byte stuffing is completely transparent to the network layer in both computers, so is bit 
stuffing. If the user data contain the flag pattern, 01111110, this flag is transmitted as 011111010 but stored in 
the receiver's memory as 01111110. Figure 3-6 gives an example of bit stuffing. 


Figure 3-6. Bit stuffing. (a) The original data. (b) The data as they appear on the line. (c) The data as they 
are stored in the receiver's memory after destuffing. 


(a) 011011111111111111110010 


m i 
| T 


()011011111011111011111010010 
~ - 
Stuffed bits 


(c) 011011111111111111110010 


With bit stuffing, the boundary between two frames can be unambiguously recognized by the flag pattern. Thus, 
if the receiver loses track of where it is, all it has to do is scan the input for flag sequences, since they can only 
occur at frame boundaries and never within the data. 


The last method of framing is only applicable to networks in which the encoding on the physical medium 
contains some redundancy. For example, some LANs encode 1 bit of data by using 2 physical bits. Normally, a 1 
bit is a high-low pair and a 0 bit is a low-high pair. The scheme means that every data bit has a transition in the 
middle, making it easy for the receiver to locate the bit boundaries. The combinations high-high and low-low are 
not used for data but are used for delimiting frames in some protocols. 


As a final note on framing, many data link protocols use a combination of a character count with one of the other 
methods for extra safety. When a frame arrives, the count field is used to locate the end of the frame. Only if the 
appropriate delimiter is present at that position and the checksum is correct is the frame accepted as valid. 
Otherwise, the input stream is scanned for the next delimiter. 


3.1.3 Error Control 


Having solved the problem of marking the start and end of each frame, we come to the next problem: how to 
make sure all frames are eventually delivered to the network layer at the destination and in the proper order. 
Suppose that the sender just kept outputting frames without regard to whether they were arriving properly. This 
might be fine for unacknowledged connectionless service, but would most certainly not be fine for reliable, 
connection-oriented service. 


The usual way to ensure reliable delivery is to provide the sender with some feedback about what is happening 
at the other end of the line. Typically, the protocol calls for the receiver to send back special control frames 
bearing positive or negative acknowledgements about the incoming frames.If the sender receives a positive 
acknowledgement about a frame, it knows the frame has arrived safely. On the other hand, a negative 
acknowledgement means that something has gone wrong, and the frame must be transmitted again. 


An additional complication comes from the possibility that hardware troubles may cause a frame to vanish 
completely (e.g., in a noise burst). In this case, the receiver will not react at all, since it has no reason to react. It 
should be clear that a protocol in which the sender transmits a frame and then waits for an acknowledgement, 
positive or negative, will hang forever if a frame is ever lost due to, for example, malfunctioning hardware. 


This possibility is dealt with by introducing timers into the data link layer. When the sender transmits a frame, it 
generally also starts a timer. The timer is set to expire after an interval long enough for the frame to reach the 
destination, be processed there, and have the acknowledgement propagate back to the sender. Normally, the 
frame will be correctly received and the acknowledgement will get back before the timer runs out, in which case 
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the timer will be canceled. 


However, if either the frame or the acknowledgement is lost, the timer will go off, alerting the sender to a 
potential problem. The obvious solution is to just transmit the frame again. However, when frames may be 
transmitted multiple times there is a danger that the receiver will accept the same frame two or more times and 
pass it to the network layer more than once. To prevent this from happening, it is generally necessary to assign 
sequence numbers to outgoing frames, so that the receiver can distinguish retransmissions from originals. 


The whole issue of managing the timers and sequence numbers so as to ensure that each frame is ultimately 
passed to the network layer at the destination exactly once, no more and no less, is an important part of the data 
link layer's duties. Later in this chapter, we will look at a series of increasingly sophisticated examples to see 
how this management is done. 


3.1.4 Flow Control 


Another important design issue that occurs in the data link layer (and higher layers as well) is what to do with a 
sender that systematically wants to transmit frames faster than the receiver can accept them. This situation can 
easily occur when the sender is running on a fast (or lightly loaded) computer and the receiver is running on a 
slow (or heavily loaded) machine. The sender keeps pumping the frames out at a high rate until the receiver is 
completely swamped. Even if the transmission is error free, at a certain point the receiver will simply be unable 
to handle the frames as they arrive and will start to lose some. Clearly, something has to be done to prevent this 
situation. 


Two approaches are commonly used. In the first one, feedback-based flow control, the receiver sends back 
information to the sender giving it permission to send more data or at least telling the sender how the receiver is 
doing. In the second one, rate-based flow control, the protocol has a built-in mechanism that limits the rate at 
which senders may transmit data, without using feedback from the receiver. In this chapter we will study 
feedback-based flow control schemes because rate-based schemes are never used in the data link layer. We 
will look at rate-based schemes in Chap. 5. 


Various feedback-based flow control schemes are known, but most of them use the same basic principle. The 
protocol contains well-defined rules about when a sender may transmit the next frame. These rules often prohibit 
frames from being sent until the receiver has granted permission, either implicitly or explicitly. For example, 
when a connection is set up, the receiver might say: "You may send me n frames now, but after they have been 
sent, do not send any more until | have told you to continue." We will examine the details shortly. 


3.2 Error Detection and Correction 


As we saw in Chap. 2, the telephone system has three parts: the switches, the interoffice trunks, and the local 
loops. The first two are now almost entirely digital in most developed countries. The local loops are still analog 
twisted copper pairs and will continue to be so for years due to the enormous expense of replacing them. While 
errors are rare on the digital part, they are still common on the local loops. Furthermore, wireless communication 
is becoming more common, and the error rates here are orders of magnitude worse than on the interoffice fiber 
trunks. The conclusion is: transmission errors are going to be with us for many years to come. We have to learn 
how to deal with them. 


As a result of the physical processes that generate them, errors on some media (e.g., radio) tend to come in 
bursts rather than singly. Having the errors come in bursts has both advantages and disadvantages over 
isolated single-bit errors. On the advantage side, computer data are always sent in blocks of bits. Suppose that 
the block size is 1000 bits and the error rate is 0.001 per bit. If errors were independent, most blocks would 
contain an error. If the errors came in bursts of 100 however, only one or two blocks in 100 would be affected, on 
average. The disadvantage of burst errors is that they are much harder to correct than are isolated errors. 


3.2.1 Error-Correcting Codes 


Network designers have developed two basic strategies for dealing with errors. One way is to include enough 
redundant information along with each block of data sent, to enable the receiver to deduce what the transmitted 
data must have been. The other way is to include only enough redundancy to allow the receiver to deduce that 
an error occurred, but not which error, and have it request a retransmission. The former strategy uses error- 
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correcting codes and the latter uses error-detecting codes. The use of error-correcting codes is often referred to 
as forward error correction. 


Each of these techniques occupies a different ecological niche. On channels that are highly reliable, such as 
fiber, it is cheaper to use an error detecting code and just retransmit the occasional block found to be faulty. 
However, on channels such as wireless links that make many errors, it is better to add enough redundancy to 
each block for the receiver to be able to figure out what the original block was, rather than relying on a 
retransmission, which itself may be in error. 


To understand how errors can be handled, it is necessary to look closely at what an error really is. Normally, a 
frame consists of m data (i.e., message) bits and r redundant, or check, bits. Let the total length be n (i.e., n =m 
+r). An n-bit unit containing data and check bits is often referred to as an n-bit codeword. 

Given any two codewords, say, 10001001 and 10110001, it is possible to determine how many corresponding 
bits differ. In this case, 3 bits differ. To determine how many bits differ, just exclusive OR the two codewords and 
count the number of 1 bits in the result, for example: 


10001001 
10110001 
00111000 


The number of bit positions in which two codewords differ is called the Hamming distance (Hamming, 1950). Its 
significance is that if two codewords are a Hamming distance d apart, it will require d single-bit errors to convert 
one into the other. 


In most data transmission applications, all 2" possible data messages are legal, but due to the way the check 
bits are computed, not all of the 2^ possible codewords are used. Given the algorithm for computing the check 
bits, it is possible to construct a complete list of the legal codewords, and from this list find the two codewords 
whose Hamming distance is minimum. This distance is the Hamming distance of the complete code. 


The error-detecting and error-correcting properties of a code depend on its Hamming distance. To detect d 
errors, you need a distance d 1 code because with such a code there is no way that d single-bit errors can 
change a valid codeword into another valid codeword. When the receiver sees an invalid codeword, it can tell 
that a transmission error has occurred. Similarly, to correct d errors, you need a distance 2d + 1 code because 
that way the legal codewords are so far apart that even with d changes, the original codeword is still closer than 
any other codeword, so it can be uniquely determined. 


As a simple example of an error-detecting code, consider a code in which a single parity bit is appended to the 
data. The parity bit is chosen so that the number of 1 bits in the codeword is even (or odd). For example, when 
1011010 is sent in even parity, a bit is added to the end to make it 10110100. With odd parity 1011010 becomes 
10110101. A code with a single parity bit has a distance 2, since any single-bit error produces a codeword with 
the wrong parity. It can be used to detect single errors. 


As a simple example of an error-correcting code, consider a code with only four valid codewords: 
0000000000, 0000011111, 1111100000, and 1111111111 


This code has a distance 5, which means that it can correct double errors. If the codeword 0000000111 arrives, 
the receiver knows that the original must have been 0000011111. If, however, a triple error changes 
0000000000 into 00000001 11, the error will not be corrected properly. 


Imagine that we want to design a code with m message bits and r check bits that will allow all single errors to be 
corrected. Each of the 2" legal messages has n illegal codewords at a distance 1 from it. These are formed by 
systematically inverting each of the n bits in the n-bit codeword formed from it. Thus, each of the 2" legal 
messages requires n + 1 bit patterns dedicated to it. Since the total number of bit patterns is 2^, we must have (n 


4 1)2m S» Using n =m + r, this requirement becomes (m +r + 1) S. Given m, this puts a lower limit on the 
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number of check bits needed to correct single errors. 


This theoretical lower limit can, in fact, be achieved using a method due to Hamming (1950). The bits of the 
codeword are numbered consecutively, starting with bit 1 at the left end, bit 2 to its immediate right, and so on. 
The bits that are powers of 2 (1, 2, 4, 8, 16, etc.) are check bits. The rest (3, 5, 6, 7, 9, etc.) are filled up with the 
m data bits. Each check bit forces the parity of some collection of bits, including itself, to be even (or odd). A bit 
may be included in several parity computations. To see which check bits the data bit in position k contributes to, 
rewrite k as a sum of powers of 2. For example, 11 = 1 + 2 + 8 and 29 2 1 + 4 + 8 + 16. A bit is checked by just 
those check bits occurring in its expansion (e.g., bit 11 is checked by bits 1, 2, and 8). 


When a codeword arrives, the receiver initializes a counter to zero. It then examines each check bit, k (k = 1, 2, 
4, 8, ...), to see if it has the correct parity. If not, the receiver adds k to the counter. If the counter is zero after all 
the check bits have been examined (i.e., if they were all correct), the codeword is accepted as valid. If the 
counter is nonzero, it contains the number of the incorrect bit. For example, if check bits 1, 2, and 8 are in error, 
the inverted bit is 11, because it is the only one checked by bits 1, 2, and 8. Figure 3-7 shows some 7-bit ASCII 
characters encoded as 11-bit codewords using a Hamming code. Remember that the data are found in bit 
positions 3, 5, 6, 7, 9, 10, and 11. 


Figure 3-7. Use of a Hamming code to correct burst errors. 


Char. ASCII Check bits 
AN 

f \ 
H 1001000 00110010000 
a 1100001 10111001001 
m 1101101 11101010101 
m 1101101 11101010101 
i 1101001 01101011001 
n 1101110 01101010110 
g 1100111 01111001111 
0100000 10011000000 
c 1100011 11111000011 
° 1101111 10101011111 
d 1100100 11111001100 
2 1100101 i 00111000101 


Order of bit transmission 


Hamming codes can only correct single errors. However, there is a trick that can be used to permit Hamming 
codes to correct burst errors. A sequence of k consecutive codewords are arranged as a matrix, one codeword 
per row. Normally, the data would be transmitted one codeword at a time, from left to right. To correct burst 
errors, the data should be transmitted one column at a time, starting with the leftmost column. When all k bits 
have been sent, the second column is sent, and so on, as indicated in Fig. 3-7. When the frame arrives at the 
receiver, the matrix is reconstructed, one column at a time. If a burst error of length k occurs, at most 1 bit in 
each of the k codewords will have been affected, but the Hamming code can correct one error per codeword, so 
the entire block can be restored. This method uses kr check bits to make blocks of km data bits immune to a 
single burst error of length k or less. 


3.2.2 Error-Detecting Codes 


Error-correcting codes are widely used on wireless links, which are notoriously noisy and error prone when 
compared to copper wire or optical fibers. Without error-correcting codes, it would be hard to get anything 
through. However, over copper wire or fiber, the error rate is much lower, so error detection and retransmission 
is usually more efficient there for dealing with the occasional error. 


As a simple example, consider a channel on which errors are isolated and the error rate is 10° per bit. Let the 
block size be 1000 bits. To provide error correction for 1000-bit blocks, 10 check bits are needed; a megabit of 
data would require 10,000 check bits. To merely detect a block with a single 1-bit error, one parity bit per block 
will suffice. Once every 1000 blocks, an extra block (1001 bits) will have to be transmitted. The total overhead for 
the error detection + retransmission method is only 2001 bits per megabit of data, versus 10,000 bits for a 
Hamming code. 


If a single parity bit is added to a block and the block is badly garbled by a long burst error, the probability that 
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the error will be detected is only 0.5, which is hardly acceptable. The odds can be improved considerably if each 
block to be sent is regarded as a rectangular matrix n bits wide and k bits high, as described above. A parity bit 
is computed separately for each column and affixed to the matrix as the last row. The matrix is then transmitted 
one row at a time. When the block arrives, the receiver checks all the parity bits. If any one of them is wrong, the 
receiver requests a retransmission of the block. Additional retransmissions are requested as needed until an 
entire block is received without any parity errors. 


This method can detect a single burst of length n, since only 1 bit per column will be changed. A burst of length n 
* 1 will pass undetected, however, if the first bit is inverted, the last bit is inverted, and all the other bits are 
correct. (A burst error does not imply that all the bits are wrong; it just implies that at least the first and last are 
wrong.) If the block is badly garbled by a long burst or by multiple shorter bursts, the probability that any of the n 
columns will have the correct parity, by accident, is 0.5, so the probability of a bad block being accepted when it 
should not be is 2”. 


Although the above scheme may sometimes be adequate, in practice, another method is in widespread use: the 
polynomial code, also known as a CRC (Cyclic Redundancy Check). Polynomial codes are based upon treating 
bit strings as representations of polynomials with coefficients of 0 and 1 only. A k-bit frame is regarded as the 
coefficient list for a polynomial with k terms, ranging from x* -t to x°. Such a polynomial is said to be of degree k - 
1. The high-order (leftmost) bit is the coefficient of x* - 1; the next bit is the coefficient of x* - ?, and so on. For 
example, 110001 has 6 bits and thus represents a six-term polynomial with coefficients 1, 1, 0, 0, 0, and 1: x5 + 
x^ + x9, 


Polynomial arithmetic is done modulo 2, according to the rules of algebraic field theory. There are no carries for 
addition or borrows for subtraction. Both addition and subtraction are identical to exclusive OR. For example: 


10011011 00110011 11110000 01010101 
* 11001010 + 11001101 - 10100110 — — 10101111 


01010001 11111110 01010110 11111010 


Long division is carried out the same way as it is in binary except that the subtraction is done modulo 2, as 
above. A divisor is said "to go into" a dividend if the dividend has as many bits as the divisor. 


When the polynomial code method is employed, the sender and receiver must agree upon a generator 
polynomial, G(x), in advance. Both the high- and low-order bits of the generator must be 1. To compute the 
checksum for some frame with m bits, corresponding to the polynomial M(x), the frame must be longer than the 
generator polynomial. The idea is to append a checksum to the end of the frame in such a way that the 
polynomial represented by the checksummed frame is divisible by G(x). When the receiver gets the 
checksummed frame, it tries dividing it by G(x). If there is a remainder, there has been a transmission error. 


The algorithm for computing the checksum is as follows: 


1. Letr be the degree of G(x). Append r zero bits to the low-order end of the frame so it now contains m + r 
bits and corresponds to the polynomial x'M (x). 

2. Divide the bit string corresponding to G(x) into the bit string corresponding to x'M (x), using modulo 2 
division. 

3. Subtract the remainder (which is always r or fewer bits) from the bit string corresponding to x'M (x) using 
modulo 2 subtraction. The result is the checksummed frame to be transmitted. Call its polynomial T(x). 


Figure 3-8 illustrates the calculation for a frame 1101011011 using the generator G(x) = x^ + x + 1. 
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Figure 3-8. Calculation of the polynomial code checksum. 


Frame : 1101011011 
Generator: 10011 
Message after 4 zero bits are appended: 11010110110000 


1100001010 
10011/11010110110000 
10011] 
10011 
10011 
00001 
00000 
00010 
00000 
00101 
00000 
01011 
00000 
10110 
10011 
01010 
00000 
10100 
10011 
01110 
00000 _- Remainder 
1110 


Transmitted fame: 11010110111110 


It should be clear that T(x) is divisible (modulo 2) by G(x). In any division problem, if you diminish the dividend by 
the remainder, what is left over is divisible by the divisor. For example, in base 10, if you divide 210,278 by 
10,941, the remainder is 2399. By subtracting 2399 from 210,278, what is left over (207,879) is divisible by 
10,941. 


Now let us analyze the power of this method. What kinds of errors will be detected? Imagine that a transmission 
error occurs, so that instead of the bit string for T(x) arriving, T(x) + E(x) arrives. Each 1 bit in E(x) corresponds 
to a bit that has been inverted. If there are k 1 bits in E(x), k single-bit errors have occurred. A single burst error 
is characterized by an initial 1, a mixture of Os and 1s, and a final 1, with all other bits being 0. 


Upon receiving the checksummed frame, the receiver divides it by G(x); that is, it computes [T(x) + E(x)/G(x). 
T(x)/G(x) is 0, so the result of the computation is simply E(x)/G(x). Those errors that happen to correspond to 
polynomials containing G(x) as a factor will slip by; all other errors will be caught. 


If there has been a single-bit error, E(x) = xi, where i determines which bit is in error. If G(x) contains two or more 
terms, it will never divide E(x), so all single-bit errors will be detected. 


If there have been two isolated single-bit errors, E(x) = x! + xi, where i > j. Alternatively, this can be written as 
E(x) = xxi -i 1). If we assume that G(x) is not divisible by x, a sufficient condition for all double errors to be 
detected is that G(x) does not divide xk + 1 for any k up to the maximum value of i - j (i.e., up to the maximum 
frame length). Simple, low-degree polynomials that give protection to long frames are known. For example, x!5 + 


114 


x'4 + 1 will not divide x* + 1 for any value of k below 32,768. 


If there are an odd number of bits in error, E(X) contains an odd number of terms (e.g., x5 + x? + 1, but not x? + 
1). Interestingly, no polynomial with an odd number of terms has x + 1 as a factor in the modulo 2 system. By 
making x + 1a factor of G(x), we can catch all errors consisting of an odd number of inverted bits. 


To see that no polynomial with an odd number of terms is divisible by x 1, assume that E(x) has an odd 
number of terms and is divisible by x + 1. Factor E(x) into (x + 1) Q(x). Now evaluate E(1) = (1 + 1)Q(1). Since 1 
+ 1 = 0 (modulo 2), E(1) must be zero. If E(x) has an odd number of terms, substituting 1 for x everywhere will 
always yield 1 as the result. Thus, no polynomial with an odd number of terms is divisible by x 4 1. 


< 


Finally, and most importantly, a polynomial code with r check bits will detect all burst errors of length —r. A 
burst error of length k can be represented by x'(x* -1 + ... + 1), where i determines how far from the right-hand 
end of the received frame the burst is located. If G(x) contains an x? term, it will not have x! as a factor, so if the 
degree of the parenthesized expression is less than the degree of G(x), the remainder can never be zero. 


If the burst length is r + 1, the remainder of the division by G(x) will be zero if and only if the burst is identical to 
G(x). By definition of a burst, the first and last bits must be 1, so whether it matches depends on the r - 1 
intermediate bits. If all combinations are regarded as equally likely, the probability of such an incorrect frame 
being accepted as valid is V" - 1. 


It can also be shown that when an error burst longer than r + 1 bits occurs or when several shorter bursts occur, 
the probability of a bad frame getting through unnoticed is /2', assuming that all bit patterns are equally likely. 


Certain polynomials have become international standards. The one used in IEEE 802 is 
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Among other desirable properties, it has the property that it detects all bursts of length 32 or less and all bursts 
affecting an odd number of bits. 


Although the calculation required to compute the checksum may seem complicated, Peterson and Brown (1961) 
have shown that a simple shift register circuit can be constructed to compute and verify the checksums in 
hardware. In practice, this hardware is nearly always used. Virtually all LANs use it and point-to-point lines do, 
too, in some cases. 


For decades, it has been assumed that frames to be checksummed contain random bits. All analyses of 
checksum algorithms have been made under this assumption. Inspection of real data has shown this 
assumption to be quite wrong. As a consequence, under some circumstances, undetected errors are much more 
common than had been previously thought (Partridge et al., 1995). 


3.3 Elementary Data Link Protocols 


To introduce the subject of protocols, we will begin by looking at three protocols of increasing complexity. For 
interested readers, a simulator for these and subsequent protocols is available via the Web (see the preface). 
Before we look at the protocols, it is useful to make explicit some of the assumptions underlying the model of 
communication. To start with, we assume that in the physical layer, data link layer, and network layer are 
independent processes that communicate by passing messages back and forth. In many cases, the physical and 
data link layer processes will be running on a processor inside a special network I/O chip and the network layer 
code will be running on the main CPU. However, other implementations are also possible (e.g., three processes 
inside a single I/O chip; or the physical and data link layers as procedures called by the network layer process). 
In any event, treating the three layers as separate processes makes the discussion conceptually cleaner and 
also serves to emphasize the independence of the layers. 


Another key assumption is that machine A wants to send a long stream of data to machine B, using a reliable, 
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connection-oriented service. Later, we will consider the case where B also wants to send data to A 


simultaneously. A is assumed to have an infinite supply of data ready to send and never has to wait for data to 
be produced. Instead, when A's data link layer asks for data, the network layer is always able to comply 
immediately. (This restriction, too, will be dropped later.) 


We also assume that machines do not crash. That is, these protocols deal with communication errors, but not 
the problems caused by computers crashing and rebooting. 


As far as the data link layer is concerned, the packet passed across the interface to it from the network layer is 
pure data, whose every bit is to be delivered to the destination's network layer. The fact that the destination's 
network layer may interpret part of the packet as a header is of no concern to the data link layer. 


When the data link layer accepts a packet, it encapsulates the packet in a frame by adding a data link header 
and trailer to it (see Fig. 3-1). Thus, a frame consists of an embedded packet, some control information (in the 
header), and a checksum (in the trailer). The frame is then transmitted to the data link layer on the other 
machine. We will assume that there exist suitable library procedures to physical layer to send a frame and 
from physical layer to receive a frame. The transmitting hardware computes and appends the checksum (thus 
creating the trailer), so that the datalink layer software need not worry about it. The polynomial algorithm 
discussed earlier in this chapter might be used, for example. 


Initially, the receiver has nothing to do. It just sits around waiting for something to happen. In the example 
protocols of this chapter we will indicate that the data link layer is waiting for something to happen by the 
procedure call wait for event(&event). This procedure only returns when something has happened (e.g., a 
frame has arrived). Upon return, the variable event tells what happened. The set of possible events differs for 
the various protocols to be described and will be defined separately for each protocol. Note that in a more 
realistic situation, the data link layer will not sit in a tight loop waiting for an event, as we have suggested, but will 
receive an interrupt, which will cause it to stop whatever it was doing and go handle the incoming frame. 
Nevertheless, for simplicity we will ignore all the details of parallel activity within the data link layer and assume 
that it is dedicated full time to handling just our one channel. 


When a frame arrives at the receiver, the hardware computes the checksum. If the checksum is incorrect (i.e., 
there was a transmission error), the data link layer is so informed (event = cksum err). If the inbound frame 
arrived undamaged, the data link layer is also informed (event = frame arrival) so that it can acquire the frame 
for inspection using from physical layer. As soon as the receiving data link layer has acquired an undamaged 
frame, it checks the control information in the header, and if everything is all right, passes the packet portion to 
the network layer. Under no circumstances is a frame header ever given to a network layer. 


There is a good reason why the network layer must never be given any part of the frame header: to keep the 
network and data link protocols completely separate. As long as the network layer knows nothing at all about the 
data link protocol or the frame format, these things can be changed without requiring changes to the network 
layer's software. Providing a rigid interface between network layer and data link layer greatly simplifies the 
software design because communication protocols in different layers can evolve independently. 


Figure 3-9 shows some declarations (in C) common to many of the protocols to be discussed later. Five data 
structures are defined there: boolean, seq nr, packet, frame kind, and frame. A boolean is an enumerated type 
and can take on the values true and false. A seq nr is a small integer used to number the frames so that we can 
tell them apart. These sequence numbers run from 0 up to and including MAX SEQ, which is defined in each 
protocol needing it. A packet is the unit of information exchanged between the network layer and the data link 
layer on the same machine, or between network layer peers. In our model it always contains MAX PKT bytes, 
but more realistically it would be of variable length. 
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Figure 3-9. Some definitions needed in the protocols to follow. These definitions are located in the file 
protocol.h. 


#define MAX PKT 1024 /* determines packet size in bytes */ 
typedef enum (false, true) boolean; /* boolean type */ 
typedef unsigned int seq. nr; /* sequence or ack numbers */ 
typedef struct (unsigned char data[MAX_PKT];} packet;/* packet definition */ 
typedef enum (data, ack, nak} frame. kind; /* frame. kind definition */ 
typedef struct ( /* frames are transported in this layer */ 
frame. kind kind; /* what kind of a frame is it? */ 
seq nr seq; /* sequence number */ 
seq. nr ack; /* acknowledgement number */ 
packet info; /* the network layer packet */ 
) frame; 


/* Wait for an event to happen; return its type in event. */ 
void wait for event(event. type *event); 


/* Fetch a packet from the network layer for transmission on the channel. */ 
void from. network layer(packet *p); 


/* Deliver information from an inbound frame to the network layer. */ 
void to network layer(packet *p); 


/* Go get an inbound frame from the physical layer and copy it to r. */ 
void from. physical. layer(frame *r); 


/* Pass the frame to the physical layer for transmission. */ 
void to. physical layer(frame *s); 


/* Start the clock running and enable the timeout event. */ 
void start timer(seq. nr k); 


/* Stop the clock and disable the timeout event. */ 
void stop. timer(seq. nr k); 


/* Start an auxiliary timer and enable the ack timeout event. */ 
void start ack timer(void); 


/* Stop the auxiliary timer and disable the ack timeout event. */ 
void stop ack timer(void); 


/* Allow the network layer to cause a network. layer. ready event. */ 
void enable network layer(void); 


/* Forbid the network layer from causing a network. layer. ready event. */ 
void disable network layer(void); 


/* Macro inc is expanded in-line: Increment k circularly. */ 
#define inc(k) if (k < MAX SEQ) k = k + 1; else k = 0 


A frame is composed of four fields: kind, seq, ack, and info, the first three of which contain control information 
and the last of which may contain actual data to be transferred. These control fields are collectively called the 
frame header. 


The kind field tells whether there are any data in the frame, because some of the protocols distinguish frames 
containing only control information from those containing data as well. The seq and ack fields are used for 
sequence numbers and acknowledgements, respectively; their use will be described in more detail later. The info 
field of a data frame contains a single packet; the info field of a control frame is not used. A more realistic 
implementation would use a variable-length info field, omitting it altogether for control frames. 


Again, it is important to realize the relationship between a packet and a frame. The network layer builds a packet 
by taking a message from the transport layer and adding the network layer header to it. This packet is passed to 
the data link layer for inclusion in the info field of an outgoing frame. When the frame arrives at the destination, 
the data link layer extracts the packet from the frame and passes the packet to the network layer. In this manner, 
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the network layer can act as though machines can exchange packets directly. 


A number of procedures are also listed in Fig. 3-9. These are library routines whose details are implementation 
dependent and whose inner workings will not concern us further here. The procedure wait for event sits in a 
tight loop waiting for something to happen, as mentioned earlier. The procedures to network layer and 
from network layer are used by the data link layer to pass packets to the network layer and accept packets from 
the network layer, respectively. Note that from physical layer and to physical layer pass frames between the 
data link layer and physical layer. On the other hand, the procedures to network layer and from network layer 
pass packets between the data link layer and network layer. In other words, to network layer and 
from network layer deal with the interface between layers 2 and 3, whereas from physical layer and 
to physical layer deal with the interface between layers 1 and 2. 


In most of the protocols, we assume that the channel is unreliable and loses entire frames upon occasion. To be 
able to recover from such calamities, the sending data link layer must start an internal timer or clock whenever it 
sends a frame. If no reply has been received within a certain predetermined time interval, the clock times out and 
the data link layer receives an interrupt signal. 


In our protocols this is handled by allowing the procedure wait for event to return event = timeout. The 
procedures start timer and stop timer turn the timer on and off, respectively. Timeouts are possible only when 
the timer is running. It is explicitly permitted to call start timer while the timer is running; such a call simply resets 
the clock to cause the next timeout after a full timer interval has elapsed (unless it is reset or turned off in the 
meanwhile). 


The procedures start ack timer and stop ack timer control an auxiliary timer used to generate 
acknowledgements under certain conditions. 


The procedures enable network layer and disable network layer are used in the more sophisticated protocols, 
where we no longer assume that the network layer always has packets to send. When the data link layer 
enables the network layer, the network layer is then permitted to interrupt when it has a packet to be sent. We 
indicate this with event = network layer ready. When a network layer is disabled, it may not cause such events. 
By being careful about when it enables and disables its network layer, the data link layer can prevent the 
network layer from swamping it with packets for which it has no buffer space. 


Frame sequence numbers are always in the range 0 to MAX SEQ (inclusive), where MAX SEQ is different for 
the different protocols. It is frequently necessary to advance a sequence number by 1 circularly (i.e., MAX SEQ 
is followed by 0). The macro inc performs this incrementing. It has been defined as a macro because it is used 
in-line within the critical path. As we will see later, the factor limiting network performance is often protocol 
processing, so defining simple operations like this as macros does not affect the readability of the code but does 
improve performance. Also, since MAX SEQ will have different values in different protocols, by making it a 
macro, it becomes possible to include all the protocols in the same binary without conflict. This ability is useful 
for the simulator. 


The declarations of Fig. 3-9 are part of each of the protocols to follow. To save space and to provide a 
convenient reference, they have been extracted and listed together, but conceptually they should be merged 
with the protocols themselves. In C, this merging is done by putting the definitions in a special header file, in this 
case protocol.h, and using the #include facility of the C preprocessor to include them in the protocol files. 


3.3.1 An Unrestricted Simplex Protocol 


As an initial example we will consider a protocol that is as simple as it can be. Data are transmitted in one 
direction only. Both the transmitting and receiving network layers are always ready. Processing time can be 
ignored. Infinite buffer space is available. And best of all, the communication channel between the data link 
layers never damages or loses frames. This thoroughly unrealistic protocol, which we will nickname "utopia," is 
shown in Fig. 3-10. 
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Figure 3-10. An unrestricted simplex protocol. 


/* Protocol 1 (utopia) provides for data transmission in one direction only, from 
sender to receiver. The communication channel is assumed to be error free 
and the receiver is assumed to be able to process all the input infinitely quickly. 
Consequently, the sender just sits in a loop pumping data out onto the line as 
fast as it can. */ 


typedef enum (frame. arrival) event. type; 
#include "protocol.h" 


void sender1 (void) 

{ 
frame s; /* buffer for an outbound frame */ 
packet buffer; /* buffer for an outbound packet */ 


while (true) { 
from. network. layer(&buffer); /* go get something to send */ 


s.info = buffer; /* copy it into s for transmission */ 
to. physical. layer(&s); /* send it on its way */ 
} /* Tomorrow, and tomorrow, and tomorrow, 


Creeps in this petty pace from day to day 
To the last syllable of recorded time. 
- Macbeth, V, v */ 


void receiver1 (void) 


{ 


frame r; 
event type event; /* filled in by wait, but not used here */ 


while (true) ( 


wait for event(&event); /* only possibility is frame. arrival */ 
from. physical. layer(&r); /* go get the inbound frame */ 
to. network layer(&r.info); /* pass the data to the network layer */ 


The protocol consists of two distinct procedures, a sender and a receiver. The sender runs in the data link layer 
of the source machine, and the receiver runs in the data link layer of the destination machine. No sequence 
numbers or acknowledgements are used here, so MAX SEQ is not needed. The only event type possible is 
frame arrival (i.e., the arrival of an undamaged frame). 


The sender is in an infinite while loop just pumping data out onto the line as fast as it can. The body of the loop 
consists of three actions: go fetch a packet from the (always obliging) network layer, construct an outbound 
frame using the variable s, and send the frame on its way. Only the info field of the frame is used by this 
protocol, because the other fields have to do with error and flow control and there are no errors or flow control 
restrictions here. 


The receiver is equally simple. Initially, it waits for something to happen, the only possibility being the arrival of 
an undamaged frame. Eventually, the frame arrives and the procedure wait for event returns, with event set to 
frame arrival (which is ignored anyway). The call to from physical layer removes the newly arrived frame from 
the hardware buffer and puts it in the variable r, where the receiver code can get at it. Finally, the data portion is 
passed on to the network layer, and the data link layer settles back to wait for the next frame, effectively 
suspending itself until the frame arrives. 
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3.3.2 A Simplex Stop-and-Wait Protocol 


Now we will drop the most unrealistic restriction used in protocol 1: the ability of the receiving network layer to 
process incoming data infinitely quickly (or equivalently, the presence in the receiving data link layer of an infinite 
amount of buffer space in which to store all incoming frames while they are waiting their respective turns). The 
communication channel is still assumed to be error free however, and the data traffic is still simplex. 


The main problem we have to deal with here is how to prevent the sender from flooding the receiver with data 
faster than the latter is able to process them. In essence, if the receiver requires a time t to execute 
from physical layer plus to network layer, the sender must transmit at an average rate less than one frame per 
time t. Moreover, if we assume that no automatic buffering and queueing are done within the receiver's 
hardware, the sender must never transmit a new frame until the old one has been fetched by 
from physical layer, lest the new one overwrite the old one. 


In certain restricted circumstances (e.g., synchronous transmission and a receiving data link layer fully dedicated 
to processing the one input line), it might be possible for the sender to simply insert a delay into protocol 1 to 
slow it down sufficiently to keep from swamping the receiver. However, more usually, each data link layer will 
have several lines to attend to, and the time interval between a frame arriving and its being processed may vary 
considerably. If the network designers can calculate the worst-case behavior of the receiver, they can program 
the sender to transmit so slowly that even if every frame suffers the maximum delay, there will be no overruns. 
The trouble with this approach is that it is too conservative. It leads to a bandwidth utilization that is far below the 
optimum, unless the best and worst cases are almost the same (i.e., the variation in the data link layer's reaction 
time is small). 


A more general solution to this dilemma is to have the receiver provide feedback to the sender. After having 
passed a packet to its network layer, the receiver sends a little dummy frame back to the sender which, in effect, 
gives the sender permission to transmit the next frame. After having sent a frame, the sender is required by the 
protocol to bide its time until the little dummy (i.e., acknowledgement) frame arrives. Using feedback from the 
receiver to let the sender know when it may send more data is an example of the flow control mentioned earlier. 


Protocols in which the sender sends one frame and then waits for an acknowledgement before proceeding are 
called stop-and-wait. Figure 3-11 gives an example of a simplex stop-and-wait protocol. 
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Figure 3-11. A simplex stop-and-wait protocol. 


/* Protocol 2 (stop-and-wait) also provides for a one-directional flow of data from 
sender to receiver. The communication channel is once again assumed to be error 
free, as in protocol 1. However, this time, the receiver has only a finite buffer 
capacity and a finite processing speed, so the protocol must explicitly prevent 
the sender from flooding the receiver with data faster than it can be handled. */ 


typedef enum (frame. arrival) event type; 


#include "protocol.h" 


void sender2(void) 


{ 
frame s; 
packet buffer; 
event type event; 


while (true) ( 


from. network. layer(&buffer); 


s.info = buffer; 
to. physical layer(&s); 
wait for. event(&event); 
} 
} 


void receiver2(void) 
{ 
frame r, s; 
event type event; 
while (true) ( 
wait for event(&event); 
from. physical. layer(&r); 
to. network layer(&r.info); 
to. physical layer(&s); 


/* buffer for an outbound frame */ 
/* buffer for an outbound packet */ 
/* frame arrival is the only possibility */ 


/* go get something to send */ 

/* copy it into s for transmission */ 

/* bye-bye little frame */ 

/* do not proceed until given the go ahead */ 


/* buffers for frames */ 
/* frame. arrival is the only possibility */ 


/* only possibility is frame arrival */ 

/* go get the inbound frame */ 

/* pass the data to the network layer */ 

/* send a dummy frame to awaken sender */ 


Although data traffic in this example is simplex, going only from the sender to the receiver, frames do travel in 
both directions. Consequently, the communication channel between the two data link layers needs to be capable 
of bidirectional information transfer. However, this protocol entails a strict alternation of flow: first the sender 
sends a frame, then the receiver sends a frame, then the sender sends another frame, then the receiver sends 
another one, and so on. A half- duplex physical channel would suffice here. 


As in protocol 1, the sender starts out by fetching a packet from the network layer, using it to construct a frame, 
and sending it on its way. But now, unlike in protocol 1, the sender must wait until an acknowledgement frame 
arrives before looping back and fetching the next packet from the network layer. The sending data link layer 
need not even inspect the incoming frame: there is only one possibility. The incoming frame is always an 
acknowledgement. 


The only difference between receiver1 and receiver2 is that after delivering a packet to the network layer, 
receiver2 sends an acknowledgement frame back to the sender before entering the wait loop again. Because 
only the arrival of the frame back at the sender is important, not its contents, the receiver need not put any 
particular information in it. 


3.3.3 A Simplex Protocol for a Noisy Channel 


Now let us consider the normal situation of a communication channel that makes errors. Frames may be either 
damaged or lost completely. However, we assume that if a frame is damaged in transit, the receiver hardware 
will detect this when it computes the checksum. If the frame is damaged in such a way that the checksum is 
nevertheless correct, an unlikely occurrence, this protocol (and all other protocols) can fail (i.e., deliver an 
incorrect packet to the network layer). 


At first glance it might seem that a variation of protocol 2 would work: adding a timer. The sender could send a 
frame, but the receiver would only send an acknowledgement frame if the data were correctly received. If a 
damaged frame arrived at the receiver, it would be discarded. After a while the sender would time out and send 
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the frame again. This process would be repeated until the frame finally arrived intact. 


The above scheme has a fatal flaw in it. Think about the problem and try to discover what might go wrong before 
reading further. 


To see what might go wrong, remember that it is the task of the data link layer processes to provide error-free, 
transparent communication between network layer processes. The network layer on machine A gives a series of 
packets to its data link layer, which must ensure that an identical series of packets are delivered to the network 
layer on machine B by its data link layer. In particular, the network layer on B has no way of knowing that a 
packet has been lost or duplicated, so the data link layer must guarantee that no combination of transmission 
errors, however unlikely, can cause a duplicate packet to be delivered to a network layer. 


Consider the following scenario: 


1. The network layer on A gives packet 1 to its data link layer. The packet is correctly received at B and 
passed to the network layer on B. B sends an acknowledgement frame back to A. 

2. The acknowledgement frame gets lost completely. It just never arrives at all. Life would be a great deal 
simpler if the channel mangled and lost only data frames and not control frames, but sad to say, the 
channel is not very discriminating. 

3. The data link layer on A eventually times out. Not having received an acknowledgement, it (incorrectly) 
assumes that its data frame was lost or damaged and sends the frame containing packet 1 again. 

4. The duplicate frame also arrives at the data link layer on B perfectly and is unwittingly passed to the 
network layer there. If A is sending a file to B, part of the file will be duplicated (i.e., the copy of the file 
made by B will be incorrect and the error will not have been detected). In other words, the protocol will 
fail. 


Clearly, what is needed is some way for the receiver to be able to distinguish a frame that it is seeing for the first 
time from a retransmission. The obvious way to achieve this is to have the sender put a sequence number in the 
header of each frame it sends. Then the receiver can check the sequence number of each arriving frame to see 
if itis a new frame or a duplicate to be discarded. 


Since a small frame header is desirable, the question arises: What is the minimum number of bits needed for the 
sequence number? The only ambiguity in this protocol is between a frame, m, and its direct successor, m + 1. If 
frame m is lost or damaged, the receiver will not acknowledge it, so the sender will keep trying to send it. Once it 
has been correctly received, the receiver will send an acknowledgement to the sender. It is here that the 
potential trouble crops up. Depending upon whether the acknowledgement frame gets back to the sender 
correctly or not, the sender may try to send m or m + 1. 


The event that triggers the sender to start sending frame m + 2 is the arrival of an acknowledgement for frame m 
+ 1. But this implies that m has been correctly received, and furthermore that its acknowledgement has also 
been correctly received by the sender (otherwise, the sender would not have begun with m + 1, let alone m + 2). 
As a consequence, the only ambiguity is between a frame and its immediate predecessor or successor, not 
between the predecessor and successor themselves. 


A 1-bit sequence number (0 or 1) is therefore sufficient. At each instant of time, the receiver expects a particular 
sequence number next. Any arriving frame containing the wrong sequence number is rejected as a duplicate. 
When a frame containing the correct sequence number arrives, it is accepted and passed to the network layer. 
Then the expected sequence number is incremented modulo 2 (i.e., 0 becomes 1 and 1 becomes 0). 


An example of this kind of protocol is shown in Fig. 3-12. Protocols in which the sender waits for a positive 
acknowledgement before advancing to the next data item are often called PAR (Positive Acknowledgement with 
Retransmission) or ARQ (Automatic Repeat reQuest). Like protocol 2, this one also transmits data only in one 
direction. 
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Figure 3-12. A positive acknowledgement with retransmission protocol. 


/* Protocol 3 (par) allows unidirectional data flow over an unreliable channel. */ 


#define MAX SEQ 1 


/* must be 1 for protocol 3 */ 


typedef enum (frame. arrival, cksum err, timeout) event type; 


#include "protocol.h" 
void sender3(void) 


{ 
seq_nr next_frame_to_send; 
frame s; 
packet buffer; 
event_type event; 


next_frame_to_send = 0; 
from network layer(&buffer); 
while (true) { 
s.info = buffer; 
s.seq = next frame to send; 
to physical layer(&s); 
start timer(s.seq); 
wait for event(&event); 
if (event == frame arrival) ( 
from. physical layer(&s); 
if (s.ack == next frame to send) { 
stop timer(s.ack); 
from network layer(&buffer); 
inc(next frame to send); 


) 
) 
} 


void receiver3(void) 

{ 
seq_nr frame_expected; 
frame r, s; 
event type event; 


frame. expected = 0; 
while (true) ( 
wait for event(&event); 
if (event == frame arrival) { 
from. physical layer(&r); 
if (r.seq == frame expected) ( 
to network layer(&r.info); 
inc(frame expected); 


] 
s.ack = 1 — frame expected; 
to physical layer(&s); 


Protocol 3 differs from its predecessors in that both sender and receiver have a variable whose value is 
remembered while the data link layer is in the wait state. The sender remembers the sequence number of the 
next frame to send in next frame to send; the receiver remembers the sequence number of the next frame 
expected in frame expected. Each protocol has a short initialization phase before entering the infinite loop. 


After transmitting a frame, the sender starts the timer running. If it was already running, it will be reset to allow 
another full timer interval. The time interval should be chosen to allow enough time for the frame to get to the 
receiver, for the receiver to process it in the worst case, and for the acknowledgement frame to propagate back 
to the sender. Only when that time interval has elapsed is it safe to assume that either the transmitted frame or 
its acknowledgement has been lost, and to send a duplicate. If the timeout interval is set too short, the sender 


/* seq number of next outgoing frame */ 
/* scratch variable */ 
/* buffer for an outbound packet */ 


/* initialize outbound sequence numbers */ 
/* fetch first packet */ 


/* construct a frame for transmission */ 
/* insert sequence number in frame */ 

/* send it on its way */ 

/* if answer takes too long, time out */ 

/* frame arrival, cksum err, timeout */ 


/* get the acknowledgement */ 


/* turn the timer off */ 
/* get the next one to send */ 
/* invert next frame to send */ 


/* possibilities: frame arrival, cksum. err */ 
/* a valid frame has arrived. */ 

/* go get the newly arrived frame */ 

/* this is what we have been waiting for. */ 
/* pass the data to the network layer */ 

/* next time expect the other sequence nr */ 


/* tell which frame is being acked */ 
/* send acknowledgement */ 


123 


will transmit unnecessary frames. While these extra frames will not affect the correctness of the protocol, they 
will hurt performance. 


After transmitting a frame and starting the timer, the sender waits for something exciting to happen. Only three 
possibilities exist: an acknowledgement frame arrives undamaged, a damaged acknowledgement frame 
staggers in, or the timer expires. If a valid acknowledgement comes in, the sender fetches the next packet from 
its network layer and puts it in the buffer, overwriting the previous packet. It also advances the sequence 
number. If a damaged frame arrives or no frame at all arrives, neither the buffer nor the sequence number is 
changed so that a duplicate can be sent. 


When a valid frame arrives at the receiver, its sequence number is checked to see if it is a duplicate. If not, it is 
accepted, passed to the network layer, and an acknowledgement is generated. Duplicates and damaged frames 
are not passed to the network layer. 


3.4 Sliding Window Protocols 


In the previous protocols, data frames were transmitted in one direction only. In most practical situations, there is 
a need for transmitting data in both directions. One way of achieving full-duplex data transmission is to have two 
separate communication channels and use each one for simplex data traffic (in different directions). If this is 
done, we have two separate physical circuits, each with a "forward" channel (for data) and a "reverse" channel 
(for acknowledgements). In both cases the bandwidth of the reverse channel is almost entirely wasted. In effect, 
the user is paying for two circuits but using only the capacity of one. 


A better idea is to use the same circuit for data in both directions. After all, in protocols 2 and 3 it was already 
being used to transmit frames both ways, and the reverse channel has the same capacity as the forward 
channel. In this model the data frames from A to B are intermixed with the acknowledgement frames from A to B. 
By looking at the kind field in the header of an incoming frame, the receiver can tell whether the frame is data or 
acknowledgement. 


Although interleaving data and control frames on the same circuit is an improvement over having two separate 
physical circuits, yet another improvement is possible. When a data frame arrives, instead of immediately 
sending a separate control frame, the receiver restrains itself and waits until the network layer passes it the next 
packet. The acknowledgement is attached to the outgoing data frame (using the ack field in the frame header). 
In effect, the acknowledgement gets a free ride on the next outgoing data frame. The technique of temporarily 
delaying outgoing acknowledgements so that they can be hooked onto the next outgoing data frame is known as 


piggybacking. 


The principal advantage of using piggybacking over having distinct acknowledgement frames is a better use of 
the available channel bandwidth. The ack field in the frame header costs only a few bits, whereas a separate 
frame would need a header, the acknowledgement, and a checksum. In addition, fewer frames sent means 
fewer "frame arrival" interrupts, and perhaps fewer buffers in the receiver, depending on how the receiver's 
software is organized. In the next protocol to be examined, the piggyback field costs only 1 bit in the frame 
header. It rarely costs more than a few bits. 


However, piggybacking introduces a complication not present with separate acknowledgements. How long 
should the data link layer wait for a packet onto which to piggyback the acknowledgement? If the data link layer 
waits longer than the sender's timeout period, the frame will be retransmitted, defeating the whole purpose of 
having acknowledgements. If the data link layer were an oracle and could foretell the future, it would know when 
the next network layer packet was going to come in and could decide either to wait for it or send a separate 
acknowledgement immediately, depending on how long the projected wait was going to be. Of course, the data 
link layer cannot foretell the future, so it must resort to some ad hoc scheme, such as waiting a fixed number of 
milliseconds. If a new packet arrives quickly, the acknowledgement is piggybacked onto it; otherwise, if no new 
packet has arrived by the end of this time period, the data link layer just sends a separate acknowledgement 
frame. 


The next three protocols are bidirectional protocols that belong to a class called sliding window protocols. The 

three differ among themselves in terms of efficiency, complexity, and buffer requirements, as discussed later. In 

these, as in all sliding window protocols, each outbound frame contains a sequence number, ranging from 0 up 

to some maximum. The maximum is usually 2^ - 1 so the sequence number fits exactly in an n-bit field. The 

stop-and-wait sliding window protocol uses n - 1, restricting the sequence numbers to 0 and 1, but more 
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sophisticated versions can use arbitrary n. 
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The essence of all sliding window protocols is that at any instant of time, the sender maintains a set of sequence 
numbers corresponding to frames it is permitted to send. These frames are said to fall within the sending 
window. Similarly, the receiver also maintains a receiving window corresponding to the set of frames it is 
permitted to accept. The sender's window and the receiver's window need not have the same lower and upper 
limits or even have the same size. In some protocols they are fixed in size, but in others they can grow or shrink 
over the course of time as frames are sent and received. 


Although these protocols give the data link layer more freedom about the order in which it may send and receive 
frames, we have definitely not dropped the requirement that the protocol must deliver packets to the destination 
network layer in the same order they were passed to the data link layer on the sending machine. Nor have we 
changed the requirement that the physical communication channel is "wire-like," that is, it must deliver all frames 
in the order sent. 


The sequence numbers within the sender's window represent frames that have been sent or can be sent but are 
as yet not acknowledged. Whenever a new packet arrives from the network layer, it is given the next highest 
sequence number, and the upper edge of the window is advanced by one. When an acknowledgement comes 
in, the lower edge is advanced by one. In this way the window continuously maintains a list of unacknowledged 
frames. Figure 3-18 shows an example. 


Figure 3-13. A sliding window of size 1, with a 3-bit sequence number. (a) Initially. (b) After the first frame 
has been sent. (c) After the first frame has been received. (d) After the first acknowledgement has been 
received. 


Sender 7 0 7 0 
PT 1 XK: Vy UNS 
A de. 
\ i } 
5 X LX 2 5 X x 2 5 TW 5 x. 
4 3 4 3 3 4 “3 
Receiver 


7,0 0 
: — 1 
Pn X DX "28 / D 
\ 6 / \ 6 
CW Dy A A 
2 , X2? Ava X? 
3 


wat " 
4 3 4 3 ^ 4 3 


(a) (b) (c) (d) 


Since frames currently within the sender's window may ultimately be lost or damaged in transit, the sender must 
keep all these frames in its memory for possible retransmission. Thus, if the maximum window size is n, the 
sender needs n buffers to hold the unacknowledged frames. If the window ever grows to its maximum size, the 
sending data link layer must forcibly shut off the network layer until another buffer becomes free. 


The receiving data link layer's window corresponds to the frames it may accept. Any frame falling outside the 
window is discarded without comment. When a frame whose sequence number is equal to the lower edge of the 
window is received, it is passed to the network layer, an acknowledgement is generated, and the window is 
rotated by one. Unlike the sender's window, the receiver's window always remains at its initial size. Note that a 
window size of 1 means that the data link layer only accepts frames in order, but for larger windows this is not 
So. The network layer, in contrast, is always fed data in the proper order, regardless of the data link layer's 
window size. 


Figure 3-18 shows an example with a maximum window size of 1. Initially, no frames are outstanding, so the 
125 


lower and upper edges of the sender's window are equal, but as time goes on, the situation progresses as 
shown. 


3.4.1 A One-Bit Sliding Window Protocol 


Before tackling the general case, let us first examine a sliding window protocol with a maximum window size of 
1. Such a protocol uses stop-and-wait since the sender transmits a frame and waits for its acknowledgement 
before sending the next one. 


Figure 3-14 depicts such a protocol. Like the others, it starts out by defining some variables. 
Next frame to send tells which frame the sender is trying to send. Similarly, frame expected tells which frame 
the receiver is expecting. In both cases, 0 and 1 are the only possibilities. 


Figure 3-14. A 1-bit sliding window protocol. 


/* Protocol 4 (sliding window) is bidirectional. */ 


#define MAX. SEQ 1 /* must be 1 for protocol 4 */ 
typedef enum (frame. arrival, cksum. err, timeout) event type; 


#include "protocol.h" 
void protocol4 (void) 
{ 
seq_nr next_frame_to_send; 
seq_nr frame_expected; 
frame r, 5; 
packet buffer; 
event type event; 
next. frame. to. send = 0; 
frame. expected - 0; 
from. network layer(&buffer); 
s.info = buffer; 
s.seq = next frame. to. send; 
s.ack = 1 - frame expected; 
to. physical. layer(&s); 
start timer(s.seq); 
while (true) { 
wait. for event(&event); 
if (event == frame arrival) ( 
from. physical layer(&r); 
if (r.seq == frame expected) ( 
to. network layer(&r.info); 
inc(frame. expected); 


) 


if (r.ack == next frame. to. send) ( 
stop. timer(r.ack); 
from. network. layer(&buffer); 
inc(next. frame. to. send); 


} 


s.info = buffer; 

s.seq = next_frame_to_send; 
s.ack = 1 — frame expected; 
to_physical_layer(&s); 

start timer(s.seq); 


/* O or 1 only */ 

/* O or 1 only */ 

/* scratch variables */ 

/* current packet being sent */ 


/* next frame on the outbound stream */ 
/* frame expected next */ 

/* fetch a packet from the network layer */ 
/* prepare to send the initial frame */ 

/* insert sequence number into frame */ 
/* piggybacked ack */ 

/* transmit the frame */ 

/* start the timer running */ 


/* frame. arrival, cksum. err, or timeout */ 
/* a frame has arrived undamaged. */ 

/* go get it */ 

/* handle inbound frame stream. */ 

/* pass packet to network layer */ 

/* invert seq number expected next */ 


/* handle outbound frame stream. */ 
/* turn the timer off */ 

/* fetch new pkt from network layer */ 
/* invert sender's sequence number */ 


/* construct outbound frame */ 

/* insert sequence number into it */ 

/* seq number of last received frame */ 
/* transmit a frame */ 

/* start the timer running */ 


Under normal circumstances, one of the two data link layers goes first and transmits the first frame. In other 

words, only one of the data link layer programs should contain the to physical layer and start timer procedure 

calls outside the main loop. In the event that both data link layers start off simultaneously, a peculiar situation 
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arises, as discussed later. The starting machine fetches the first packet from its network layer, builds a frame 
from it, and sends it. When this (or any) frame arrives, the receiving data link layer checks to see if it is a 


duplicate, just as in protocol 3. If the frame is the one expected, it is passed to the network layer and the 
receiver's window is slid up. 


The acknowledgement field contains the number of the last frame received without error. If this number agrees 
with the sequence number of the frame the sender is trying to send, the sender knows it is done with the frame 
stored in buffer and can fetch the next packet from its network layer. If the sequence number disagrees, it must 
continue trying to send the same frame. Whenever a frame is received, a frame is also sent back. 


Now let us examine protocol 4 to see how resilient it is to pathological scenarios. Assume that computer A is 
trying to send its frame 0 to computer B and that B is trying to send its frame 0 to A. Suppose that A sends a 
frame to B, but A's timeout interval is a little too short. Consequently, A may time out repeatedly, sending a 
series of identical frames, all with seq = 0 and ack = 1. 


When the first valid frame arrives at computer B, it will be accepted and frame expected will be set to 1. All the 
subsequent frames will be rejected because B is now expecting frames with sequence number 1, not 0. 
Furthermore, since all the duplicates have ack = 1 and B is still waiting for an acknowledgement of 0, B will not 
fetch a new packet from its network layer. 


After every rejected duplicate comes in, B sends A a frame containing seq = 0 and ack = 0. Eventually, one of 
these arrives correctly at A, causing A to begin sending the next packet. No combination of lost frames or 
premature timeouts can cause the protocol to deliver duplicate packets to either network layer, to skip a packet, 
or to deadlock. 


However, a peculiar situation arises if both sides simultaneously send an initial packet. This synchronization 
difficulty is illustrated by Fig. 3-15. In part (a), the normal operation of the protocol is shown. In (b) the peculiarity 
is illustrated. If B waits for A's first frame before sending one of its own, the sequence is as shown in (a), and 
every frame is accepted. However, if A and B simultaneously initiate communication, their first frames cross, and 
the data link layers then get into situation (b). In (a) each frame arrival brings a new packet for the network layer; 
there are no duplicates. In (b) half of the frames contain duplicates, even though there are no transmission 
errors. Similar situations can occur as a result of premature timeouts, even when one side clearly starts first. In 
fact, if multiple premature timeouts occur, frames may be sent three or more times. 


Figure 3-15. Two scenarios for protocol 4. (a) Normal case. (b) Abnormal case. The notation is (seq, ack, 
packet number). An asterisk indicates where a network layer accepts a packet. 
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3.4.2 A Protocol Using Go Back N 


Until now we have made the tacit assumption that the transmission time required for a frame to arrive at the 
receiver plus the transmission time for the acknowledgement to come back is negligible. Sometimes this 
assumption is clearly false. In these situations the long round-trip time can have important implications for the 
efficiency of the bandwidth utilization. As an example, consider a 50-kbps satellite channel with a 500-msec 


round-trip propagation delay. Let us imagine trying to use protocol 4 to send 1000-bit frames via the satellite. At t 
= 0 the sender starts sending the first frame. At t = 20 msec the frame has been completely sent. Not until t = 
270 msec has the frame fully arrived at the receiver, and not until t 2 520 msec has the acknowledgement 
arrived back at the sender, under the best of circumstances (no waiting in the receiver and a short 
acknowledgement frame). This means that the sender was blocked during 500/520 or 96 percent of the time. In 
other words, only 4 percent of the available bandwidth was used. Clearly, the combination of a long transit time, 
high bandwidth, and short frame length is disastrous in terms of efficiency. 


The problem described above can be viewed as a consequence of the rule requiring a sender to wait for an 
acknowledgement before sending another frame. If we relax that restriction, much better efficiency can be 
achieved. Basically, the solution lies in allowing the sender to transmit up to w frames before blocking, instead of 
just 1. With an appropriate choice of w the sender will be able to continuously transmit frames for a time equal to 
the round-trip transit time without filling up the window. In the example above, w should be at least 26. The 
sender begins sending frame 0 as before. By the time it has finished sending 26 frames, at t = 520, the 
acknowledgement for frame 0 will have just arrived. Thereafter, acknowledgements arrive every 20 msec, so the 
sender always gets permission to continue just when it needs it. At all times, 25 or 26 unacknowledged frames 
are outstanding. Put in other terms, the sender's maximum window size is 26. 


The need for a large window on the sending side occurs whenever the product of bandwidth x round-trip-delay is 
large. If the bandwidth is high, even for a moderate delay, the sender will exhaust its window quickly unless it 
has a large window. If the delay is high (e.g., on a geostationary satellite channel), the sender will exhaust its 
window even for a moderate bandwidth. The product of these two factors basically tells what the capacity of the 
pipe is, and the sender needs the ability to fill it without stopping in order to operate at peak efficiency. 


This technique is known as pipelining. If the channel capacity is b bits/sec, the frame size | bits, and the round- 
trip propagation time R sec, the time required to transmit a single frame is |/b sec. After the last bit of a data 
frame has been sent, there is a delay of R/2 before that bit arrives at the receiver and another delay of at least 
R/2 for the acknowledgement to come back, for a total delay of R. In stop-and-wait the line is busy for l/band idle 
for R, giving 


line utilization = //(/ + bR) 


If | « bR, the efficiency will be less than 50 percent. Since there is always a nonzero delay for the 
acknowledgement to propagate back, pipelining can, in principle, be used to keep the line busy during this 
interval, but if the interval is small, the additional complexity is not worth the trouble. 


Pipelining frames over an unreliable communication channel raises some serious issues. First, what happens if a 
frame in the middle of a long stream is damaged or lost? Large numbers of succeeding frames will arrive at the 
receiver before the sender even finds out that anything is wrong. When a damaged frame arrives at the receiver, 
it obviously should be discarded, but what should the receiver do with all the correct frames following it? 
Remember that the receiving data link layer is obligated to hand packets to the network layer in sequence. In 
Fig. 3-16 we see the effects of pipelining on error recovery. We will now examine it in some detail. 
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Figure 3-16. Pipelining and error recovery. Effect of an error when (a) receiver's window size is 1 and (b) 
receiver's window size is large. 
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Two basic approaches are available for dealing with errors in the presence of pipelining. One way, called go 
back n, is for the receiver simply to discard all subsequent frames, sending no acknowledgements for the 
discarded frames. This strategy corresponds to a receive window of size 1. In other words, the data link layer 
refuses to accept any frame except the next one it must give to the network layer. If the sender's window fills up 
before the timer runs out, the pipeline will begin to empty. Eventually, the sender will time out and retransmit all 
unacknowledged frames in order, starting with the damaged or lost one. This approach can waste a lot of 
bandwidth if the error rate is high. 


In Fig. 3-16(a) we see go back n for the case in which the receiver's window is large. Frames 0 and 1 are 
correctly received and acknowledged. Frame 2, however, is damaged or lost. The sender, unaware of this 
problem, continues to send frames until the timer for frame 2 expires. Then it backs up to frame 2 and starts all 
over with it, sending 2, 3, 4, etc. all over again. 


The other general strategy for handling errors when frames are pipelined is called selective repeat. When it is 
used, a bad frame that is received is discarded, but good frames received after it are buffered. When the sender 
times out, only the oldest unacknowledged frame is retransmitted. If that frame arrives correctly, the receiver can 
deliver to the network layer, in sequence, all the frames it has buffered. Selective repeat is often combined with 
having the receiver send a negative acknowledgement (NAK) when it detects an error, for example, when it 
receives a checksum error or a frame out of sequence. NAKs stimulate retransmission before the corresponding 
timer expires and thus improve performance. 


In Fig. 3-16(b), frames 0 and 1 are again correctly received and acknowledged and frame 2 is lost. When frame 
3 arrives at the receiver, the data link layer there notices that is has missed a frame, so it sends back a NAK for 
2 but buffers 3. When frames 4 and 5 arrive, they, too, are buffered by the data link layer instead of being 
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passed to the network layer. Eventually, the NAK 2 gets back to the sender, which immediately resends frame 2. 
When that arrives, the data link layer now has 2, 3, 4, and 5 and can pass all of them to the network layer in the 
correct order. It can also acknowledge all frames up to and including 5, as shown in the figure. If the NAK should 
get lost, eventually the sender will time out for frame 2 and send it (and only it) of its own accord, but that may be 
a quite a while later. In effect, the NAK speeds up the retransmission of one specific frame. 


Selective repeat corresponds to a receiver window larger than 1. Any frame within the window may be accepted 
and buffered until all the preceding ones have been passed to the network layer. This approach can require 
large amounts of data link layer memory if the window is large. 


These two alternative approaches are trade-offs between bandwidth and data link layer buffer space. Depending 
on which resource is scarcer, one or the other can be used. Figure 3-17 shows a pipelining protocol in which the 
receiving data link layer only accepts frames in order; frames following an error are discarded. In this protocol, 
for the first time we have dropped the assumption that the network layer always has an infinite supply of packets 
to send. When the network layer has a packet it wants to send, it can cause a network layer ready event to 
happen. However, to enforce the flow control rule of no more than MAX SEQ unacknowledged frames 
outstanding at any time, the data link layer must be able to keep the network layer from bothering it with more 
work. The library procedures enable network layer and disable network layer do this job. 


Figure 3-17. A sliding window protocol using go back n. 


/* Protocol 5 (go back n) allows multiple outstanding frames. The sender may transmit up 
to MAX. SEQ frames without waiting for an ack. In addition, unlike in the previous 
protocols, the network layer is not assumed to have a new packet all the time. Instead, 
the network layer causes a network layer. ready event when there is a packet to send. */ 


#define MAX SEQ 7 /* should be 2°n — 1 */ 
typedef enum (frame. arrival, cksum. err, timeout, network layer. ready) event type; 
ftinclude “protocol.h" 


static boolean between(seq. nr a, seq. nr b, seq. nr c) 
/* Return true if a <=b « c circularly; false otherwise. */ 
if (((a <= b) && (b < c)) Il ((c < a) && (a <= b)) Il ((b < c) && (c < a))) 
return(true); 
else 
return(false); 


static void send_data(seq_nr frame. nr, seq_nr frame. expected, packet buffer ]) 


/* Construct and send a data frame. */ 


frame s; /* scratch variable */ 

s.info = buffer[frame nr]; /* insert packet into frame */ 

s.seq = frame. nr; /* insert sequence number into frame */ 
s.ack = (frame expected + MAX SEQ) % (MAX. SEQ + 1);/* piggyback ack */ 
to. physical layer(&s); /* transmit the frame */ 

start timer(frame. nr); /* start the timer running */ 


void protocol5(void) 


{ 


seq_nr next_frame_to_send; /* MAX_SEQ > 1; used for outbound stream */ 
seq. nr ack expected; /* oldest frame as yet unacknowledged */ 
seq. nr frame expected; /* next frame expected on inbound stream */ 
frame r; /* scratch variable */ 
packet buffer[ MAX SEQ + 1]; /* buffers for the outbound stream */ 
seq nr nbuffered; /* # output buffers currently in use */ 
seq. nr i; /* used to index into the buffer array */ 
event type event; 
enable. network layer(); /* allow network layer ready events */ 
ack expected = 0; /* next ack expected inbound */ 
next frame to send = 0; /* next frame going out */ 
frame expected = 0; /* number of frame expected inbound */ 
nbuffered = 0; /* initially no packets are buffered */ 
while (true) ( 

wait for event(&event); /* four possibilities: see event type above */ 


switch(event) ( 
case network. layer ready: /* the network layer has a packet to send */ 
/* Accept, save, and transmit a new frame. */ 
from network layer(&buffer[next frame to send]); /* fetch new packet */ 


nbuffered = nbuffered + 1; /* expand the sender's window */ 
send data(next frame to send, frame expected, buffer);/* transmit the frame */ 
inc(next frame to send); /* advance sender's upper window edge */ 
break; 

case frame arrival: /* a data or control frame has arrived */ 
from. physical. layer(&r); /* get incoming frame from physical layer */ 


if (r.seq == frame expected) { 
/* Frames are accepted only in order. */ 
to network layer(&r.info); /* pass packet to network layer */ 
inc(frame expected); /* advance lower edge of receiver's window */ 


} 


/* Ack n implies n — 1, n — 2, etc. Check for this. */ 
while (between(ack expected, r.ack, next frame to send)) { 
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Note that a maximum of MAX SEQ frames and not MAX SEQ + 1 frames may be outstanding at any instant, 
even though there are MAX SEQ + 1 distinct sequence numbers: 0, 1, 2, .., MAX SEQ. To see why this 
restriction is required, consider the following scenario with MAX SEQ = 7. 


The sender sends frames 0 through 7. 

A piggybacked acknowledgement for frame 7 eventually comes back to the sender. 
The sender sends another eight frames, again with sequence numbers 0 through 7. 
Now another piggybacked acknowledgement for frame 7 comes in. 


RON 


The question is this: Did all eight frames belonging to the second batch arrive successfully, or did all eight get 
lost (counting discards following an error as lost)? In both cases the receiver would be sending frame 7 as the 
acknowledgement. The sender has no way of telling. For this reason the maximum number of outstanding 
frames must be restricted to MAX SEQ. 


Although protocol 5 does not buffer the frames arriving after an error, it does not escape the problem of buffering 
altogether. Since a sender may have to retransmit all the unacknowledged frames at a future time, it must hang 
on to all transmitted frames until it knows for sure that they have been accepted by the receiver. When an 
acknowledgement comes in for frame n, frames n - 1, n - 2, and so on are also automatically acknowledged. 
This property is especially important when some of the previous acknowledgement-bearing frames were lost or 
garbled. Whenever any acknowledgement comes in, the data link layer checks to see if any buffers can now be 
released. If buffers can be released (i.e., there is some room available in the window), a previously blocked 
network layer can now be allowed to cause more network layer ready events. 


For this protocol, we assume that there is always reverse traffic on which to piggyback acknowledgements. If 
there is not, no acknowledgements can be sent. Protocol 4 does not need this assumption since it sends back 
one frame every time it receives a frame, even if it has just already sent that frame. In the next protocol we will 
solve the problem of one-way traffic in an elegant way. 


Because protocol 5 has multiple outstanding frames, it logically needs multiple timers, one per outstanding 
frame. Each frame times out independently of all the other ones. All of these timers can easily be simulated in 
software, using a single hardware clock that causes interrupts periodically. The pending timeouts form a linked 
list, with each node of the list telling the number of clock ticks until the timer expires, the frame being timed, and 
a pointer to the next node. 


As an illustration of how the timers could be implemented, consider the example of Fig. 3-18(a). Assume that the 
clock ticks once every 100 msec. Initially, the real time is 10:00:00.0; three timeouts are pending, at 10:00:00.5, 
10:00:01.3, and 10:00:01.9. Every time the hardware clock ticks, the real time is updated and the tick counter at 
the head of the list is decremented. When the tick counter becomes zero, a timeout is caused and the node is 
removed from the list, as shown in Fig. 3-18(b). Although this organization requires the list to be scanned when 
start timer or stop timer is called, it does not require much work per tick. In protocol 5, both of these routines 
have been given a parameter, indicating which frame is to be timed. 


Figure 3-18. Simulation of multiple timers in software. 
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3.4.3 A Protocol Using Selective Repeat 


Protocol 5 works well if errors are rare, but if the line is poor, it wastes a lot of bandwidth on retransmitted 
frames. An alternative strategy for handling errors is to allow the receiver to accept and buffer the frames 
following a damaged or lost one. Such a protocol does not discard frames merely because an earlier frame was 
damaged or lost. 


In this protocol, both sender and receiver maintain a window of acceptable sequence numbers. The sender's 
window size starts out at 0 and grows to some predefined maximum, MAX SEQ. The receiver's window, in 
contrast, is always fixed in size and equal to MAX SEQ. The receiver has a buffer reserved for each sequence 
number within its fixed window. Associated with each buffer is a bit (arrived) telling whether the buffer is full or 
empty. Whenever a frame arrives, its sequence number is checked by the function between to see if it falls 
within the window. If so and if it has not already been received, it is accepted and stored. This action is taken 
without regard to whether or not it contains the next packet expected by the network layer. Of course, it must be 
kept within the data link layer and not passed to the network layer until all the lower-numbered frames have 
already been delivered to the network layer in the correct order. A protocol using this algorithm is given in Fig. 3- 
19. 


Figure 3-19. A sliding window protocol using selective repeat. 


J= Protocol 6 (selective repeat) accepts frames out of order but passes packets to the 
network layer in order. Associated with each outstanding frame is a timer. When the timer 
expires, only that frame is retransmitted, not all the outstanding frames, as in protocol 5. =/ 


#define MAX SEO 7 /* should be 2n — 1 «/ 

#define NR_BUFS ((MAX SEQ + 1X2) 

typedef enum (frame arrival cksur err, timeout, network layer ready. ack timeout) event type: 
#include protocol. n^" 

boolean no nak — trie: /* no nak has been sent yet «/ 

seq nr oldest frame = MAX SEQ + 1; j~ initial value is only for the simulator «/ 


static boolean between(seq nr a, seq nr b, seq nr c) 
t 
J= Same as between in protocols, but shorter and more obscure. */ 


return ((&-—b)&E& (b--c))l! ((c < a) SS (a <= D)) H ((b — c) && (c — a) 
} 


static void send frame(frarne kind fk, seq nr frame nr, seq nr frame expected, packet buffer[ ]) 


J= Construct and send a data, ack, or nak frame. -/ 


frame s; j= scratch variable <-/ 

s.kind = fk; f= Kind == data, ack, or nak =/ 

if (fk == data) s.info = buffer[frame nr 9?5 NF1 BUFSJ: 

s.seq = frame nr; f= only meaningful for data frames */ 
s.ack = (frame expected + MAX SEO) % (MAX SEQ + 1): 

if (fk —— nak) no nak = faise; /* one nak per frame, please -/ 

to physical layer(&s)-: j~ transrnit the frame -/ 

if (fk == data) start tirmer(frame nr % NF BUFFS): 

stop ack _timer(): /* no need for separate ack frame -/ 


J 
void protocol6e(wvoid) 


seq nr ack expected; i= lower edge of senders window *€/ 

seq nr next frame to send; f£* upper edge of senders window + 1 -/ 
seq nr frame expected: j= lower edge of receivers window «*/ 

seq_ nr too far; /* upper edge of receiver's window + 1 -/ 

int i; /* index into buffer pool -/ 

frame r; /* scratch variable -/ 

packet out bufi NR BUFFS]: = buffers for the outbound stream */ 

packet in buf([NR_BUFS]: f buffers for the inbound stream */ 

boolean arrived[NAR_ BUFFS]: j~ inbound bit map -/ 

sedq nr nbuffered; ja how many output buffers currently used «*/ 
event type event: 

enable network Iayer(): j~ initialize */ 

ack expected = 0: /* next ack expected on the inbound stream */ 
next frame to send = O; /* number of next outgoing frame -/ 


frame expected = O; 

too far = NR_BUFS: 

nbuffered = O; /* initially no packets are buffered -/ 
for (i = O; i < NR_BUEFS; i++) arrived[i] = false: 

while (true) { 


wait for ewvent(&event): /* tive possibilities- see event type above */ 
switch(event) { 
case network layer ready: /* accept, save, and transmit a new frame -/ 
nbutfered = nbuffered + 1; f expand the window -/ 


from network layer(&out buf[next frame to send ?5 NR_BUFS)); /* fetch new packet «/ 
send frame(data, next frame to send, frame expected, out buf); transmit the frame =/ 


inc(next frame to send); /* advance upper window edge «/ 
break; 
case frame_arrival- /* a data or control frame has arrived */ 
from physical layer(&r); j= fetch incoming frame from physical layer */ 
if (r.Kind == data) { 


j= An undamaged frame has arrived. =/ 
af ((r_seqg != frame expected) && no nak) 
send frare(nak, O. frame expected, out buf}; else start_ack_timer({): 
if (between(frame_expected,r.seq,.too_ far) && (arrived[r.sedq?soNF3 EBEUfFS]——false)) ( 
f+ Frames may be accepted in any order. -/ 
arrived[r.seq % NR_BUFS] = true: J= mark buffer as full -/ 
in buf[r.seq % NA BUFFS] = r.info; J= insert data into buffer =/ 
while (arrivediframe expected % NR BUFS]J) { 
/- Pass frames and advance window. +/ 
to network layer(&in buf[frarne expected % NR _ BUF S]: 
no nak = true: 
arrived[frame expected % NA_BUFS] = false: 
inc(frarme expected}; j= advance tower edge of receivers window */ 
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Nonsequential receive introduces certain problems not present in protocols in which frames are only accepted in 
order. We can illustrate the trouble most easily with an example. Suppose that we have a 3-bit sequence 
number, so that the sender is permitted to transmit up to seven frames before being required to wait for an 
acknowledgement. Initially, the sender's and receiver's windows are as shown in Fig. 3-20(a). The sender now 
transmits frames 0 through 6. The receiver's window allows it to accept any frame with sequence number 
between 0 and 6 inclusive. All seven frames arrive correctly, so the receiver acknowledges them and advances 
its window to allow receipt of 7, 0, 1, 2, 3, 4, or 5, as shown in Fig. 3-20(b). All seven buffers are marked empty. 


Figure 3-20. (a) Initial situation with a window of size seven. (b) After seven frames have been sent and 
received but not acknowledged. (c) Initial situation with a window size of four. (d) After four frames have 
been sent and received but not acknowledged. 


Sender 012345 6/7 012345 6/7 0123|4567|0123|4567 
Receiver 01234567 [o1234se[7 [o123]567 01234567 


(a) (b) (c) (d) 


It is at this point that disaster strikes in the form of a lightning bolt hitting the telephone pole and wiping out all the 
acknowledgements. The sender eventually times out and retransmits frame 0. When this frame arrives at the 
receiver, a check is made to see if it falls within the receiver's window. Unfortunately, in Fig. 3-20(b) frame 0 is 
within the new window, so it will be accepted. The receiver sends a piggybacked acknowledgement for frame 6, 
since 0 through 6 have been received. 


The sender is happy to learn that all its transmitted frames did actually arrive correctly, so it advances its window 
and immediately sends frames 7, 0, 1, 2, 3, 4, and 5. Frame 7 will be accepted by the receiver and its packet will 
be passed directly to the network layer. Immediately thereafter, the receiving data link layer checks to see if it 
has a valid frame O already, discovers that it does, and passes the embedded packet to the network layer. 
Consequently, the network layer gets an incorrect packet, and the protocol fails. 


The essence of the problem is that after the receiver advanced its window, the new range of valid sequence 
numbers overlapped the old one. Consequently, the following batch of frames might be either duplicates (if all 
the acknowledgements were lost) or new ones (if all the acknowledgements were received). The poor receiver 
has no way of distinguishing these two cases. 


The way out of this dilemma lies in making sure that after the receiver has advanced its window, there is no 
overlap with the original window. To ensure that there is no overlap, the maximum window size should be at 
most half the range of the sequence numbers, as is done in Fig. 3-20(c) and Fig. 3-20(d). For example, if 4 bits 
are used for sequence numbers, these will range from 0 to 15. Only eight unacknowledged frames should be 
outstanding at any instant. That way, if the receiver has just accepted frames 0 through 7 and advanced its 
window to permit acceptance of frames 8 through 15, it can unambiguously tell if subsequent frames are 
retransmissions (0 through 7) or new ones (8 through 15). In general, the window size for protocol 6 will be 
(MAX SEQ + 1)/2. Thus, for 3-bit sequence numbers, the window size is four. 


An interesting question is: How many buffers must the receiver have? Under no conditions will it ever accept 
frames whose sequence numbers are below the lower edge of the window or frames whose sequence numbers 
are above the upper edge of the window. Consequently, the number of buffers needed is equal to the window 
size, not to the range of sequence numbers. In the above example of a 4-bit sequence number, eight buffers, 
numbered 0 through 7, are needed. When frame i arrives, it is put in buffer i mod 8. Notice that although i and (i 
+ 8) mod 8 are "competing" for the same buffer, they are never within the window at the same time, because 
that would imply a window size of at least 9. 


For the same reason, the number of timers needed is equal to the number of buffers, not to the size of the 
sequence space. Effectively, a timer is associated with each buffer. When the timer runs out, the contents of the 
buffer are retransmitted. 
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In protocol 5, there is an implicit assumption that the channel is heavily loaded. When a frame arrives, no 
acknowledgement is sent immediately. Instead, the acknowledgement is piggybacked onto the next outgoing 
data frame. If the reverse traffic is light, the acknowledgement will be held up for a long period of time. If there is 
a lot of traffic in one direction and no traffic in the other direction, only MAX SEQ packets are sent, and then the 
protocol blocks, which is why we had to assume there was always some reverse traffic. 


In protocol 6 this problem is fixed. After an in-sequence data frame arrives, an auxiliary timer is started by 
start ack timer. If no reverse traffic has presented itself before this timer expires, a separate acknowledgement 
frame is sent. An interrupt due to the auxiliary timer is called an ack timeout event. With this arrangement, one- 
directional traffic flow is now possible because the lack of reverse data frames onto which acknowledgements 
can be piggybacked is no longer an obstacle. Only one auxiliary timer exists, and if start ack timer is called 
while the timer is running, it is reset to a full acknowledgement timeout interval. 


It is essential that the timeout associated with the auxiliary timer be appreciably shorter than the timer used for 
timing out data frames. This condition is required to make sure a correctly received frame is acknowledged early 
enough that the frame's retransmission timer does not expire and retransmit the frame. 


Protocol 6 uses a more efficient strategy than protocol 5 for dealing with errors. Whenever the receiver has 
reason to suspect that an error has occurred, it sends a negative acknowledgement (NAK) frame back to the 
sender. Such a frame is a request for retransmission of the frame specified in the NAK. There are two cases 
when the receiver should be suspicious: a damaged frame has arrived or a frame other than the expected one 
arrived (potential lost frame). To avoid making multiple requests for retransmission of the same lost frame, the 
receiver should keep track of whether a NAK has already been sent for a given frame. The variable no nak in 
protocol 6 is true if no NAK has been sent yet for frame expected. If the NAK gets mangled or lost, no real harm 
is done, since the sender will eventually time out and retransmit the missing frame anyway. If the wrong frame 
arrives after a NAK has been sent and lost, no nak will be true and the auxiliary timer will be started. When it 
expires, an ACK will be sent to resynchronize the sender to the receiver's current status. 


In some situations, the time required for a frame to propagate to the destination, be processed there, and have 
the acknowledgement come back is (nearly) constant. In these situations, the sender can adjust its timer to be 
just slightly larger than the normal time interval expected between sending a frame and receiving its 
acknowledgement. However, if this time is highly variable, the sender is faced with the choice of either setting 
the interval to a small value (and risking unnecessary retransmissions), or setting it to a large value (and going 
idle for a long period after an error). 


Both choices waste bandwidth. If the reverse traffic is sporadic, the time before acknowledgement will be 
irregular, being shorter when there is reverse traffic and longer when there is not. Variable processing time within 
the receiver can also be a problem here. In general, whenever the standard deviation of the acknowledgement 
interval is small compared to the interval itself, the timer can be set "tight" and NAKs are not useful. Otherwise 
the timer must be set "loose," to avoid unnecessary retransmissions, but NAKs can appreciably speed up 
retransmission of lost or damaged frames. 


Closely related to the matter of timeouts and NAKs is the question of determining which frame caused a timeout. 
In protocol 5, it is always ack expected, because it is always the oldest. In protocol 6, there is no trivial way to 
determine who timed out. Suppose that frames O through 4 have been transmitted, meaning that the list of 
outstanding frames is 01234, in order from oldest to youngest. Now imagine that 0 times out, 5 (a new frame) is 
transmitted, 1 times out, 2 times out, and 6 (another new frame) is transmitted. At this point the list of 
outstanding frames is 3405126, from oldest to youngest. If all inbound traffic (i.e., acknowledgement-bearing 
frames) is lost for a while, the seven outstanding frames will time out in that order. 


To keep the example from getting even more complicated than it already is, we have not shown the timer 
administration. Instead, we just assume that the variable oldest frame is set upon timeout to indicate which 
frame timed out. 


3.5 Protocol Verification 


Realistic protocols and the programs that implement them are often quite complicated. Consequently, much 

research has been done trying to find formal, mathematical techniques for specifying and verifying protocols. In 

the following sections we will look at some models and techniques. Although we are looking at them in the 
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context of the data link layer, they are also applicable to other layers. 
3.5.1 Finite State Machine Models 


A key concept used in many protocol models is the finite state machine. With this technique, each protocol 
machine (i.e., sender or receiver) is always in a specific state at every instant of time. Its state consists of all the 
values of its variables, including the program counter. 


In most cases, a large number of states can be grouped for purposes of analysis. For example, considering the 
receiver in protocol 3, we could abstract out from all the possible states two important ones: waiting for frame 0 
or waiting for frame 1. All other states can be thought of as transient, just steps on the way to one of the main 
states. Typically, the states are chosen to be those instants that the protocol machine is waiting for the next 
event to happen [i.e., executing the procedure call wait(event) in our examples]. At this point the state of the 
protocol machine is completely determined by the states of its variables. The number of states is then 2^, where 
n is the number of bits needed to represent all the variables combined. 


The state of the complete system is the combination of all the states of the two protocol machines and the 
channel. The state of the channel is determined by its contents. Using protocol 3 again as an example, the 
channel has four possible states: a 0 frame or a 1 frame moving from sender to receiver, an acknowledgement 
frame going the other way, or an empty channel. If we model the sender and receiver as each having two states, 
the complete system has 16 distinct states. 


A word about the channel state is in order. The concept of a frame being "on the channel" is an abstraction, of 
course. What we really mean is that a frame has possibly been received, but not yet processed at the 
destination. A frame remains "on the channel" until the protocol machine executes FromPhysicalLayer and 
processes it. 


From each state, there are zero or more possible transitions to other states. Transitions occur when some event 
happens. For a protocol machine, a transition might occur when a frame is sent, when a frame arrives, when a 
timer expires, when an interrupt occurs, etc. For the channel, typical events are insertion of a new frame onto the 
channel by a protocol machine, delivery of a frame to a protocol machine, or loss of a frame due to noise. Given 
a complete description of the protocol machines and the channel characteristics, it is possible to draw a directed 
graph showing all the states as nodes and all the transitions as directed arcs. 


One particular state is designated as the initial state. This state corresponds to the description of the system 
when it starts running, or at some convenient starting place shortly thereafter. From the initial state, some, 
perhaps all, of the other states can be reached by a sequence of transitions. Using well-known techniques from 
graph theory (e.g., computing the transitive closure of a graph), it is possible to determine which states are 
reachable and which are not. This technique is called reachability analysis (Lin et al., 1987). This analysis can 
be helpful in determining whether a protocol is correct. 


Formally, a finite state machine model of a protocol can be regarded as a quadruple (S, M, |, T), where: 
S is the set of states the processes and channel can be in. 

M is the set of frames that can be exchanged over the channel. 

| is the set of initial states of the processes. 


T is the set of transitions between states. 


At the beginning of time, all processes are in their initial states. Then events begin to happen, such as frames 
becoming available for transmission or timers going off. Each event may cause one of the processes or the 
channel to take an action and switch to a new state. By carefully enumerating each possible successor to each 
state, one can build the reachability graph and analyze the protocol. 


Reachability analysis can be used to detect a variety of errors in the protocol specification. For example, if it is 

possible for a certain frame to occur in a certain state and the finite state machine does not say what action 

should be taken, the specification is in error (incompleteness). If there exists a set of states from which no exit 
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can be made and from which no progress can be made (i.e., no correct frames can be received any more), we 
have another error (deadlock). A less serious error is protocol specification that tells how to handle an event in a 
state in which the event cannot occur (extraneous transition). Other errors can also be detected. 


As an example of a finite state machine model, consider Fig. 3-21(a). This graph corresponds to protocol 3 as 
described above: each protocol machine has two states and the channel has four states. A total of 16 states 
exist, not all of them reachable from the initial one. The unreachable ones are not shown in the figure. 
Checksum errors are also ignored here for simplicity. 


Figure 3-21. (a) State diagram for protocol 3. (b) Transitions. 


To 
Who Frame Frame network 
Transition runs? accepted emitted layer 


0 2 (frame lost} _ 
1 R 0 A Yes 
2 S A 1 - 
3 R 1 A Yes 
4 S A 0 - 
5 R 0 A No 
6 R 1 A No 
7 S (timeout) 0 - 
8 S (timeout) 1 
(b) 


Each state is labeled by three characters, SRC, where S is 0 or 1, corresponding to the frame the sender is 
trying to send; R is also 0 or 1, corresponding to the frame the receiver expects, and C is 0, 1, A, or empty (-), 
corresponding to the state of the channel. In this example the initial state has been chosen as (000). In other 
words, the sender has just sent frame 0, the receiver expects frame 0, and frame 0 is currently on the channel. 


Nine kinds of transitions are shown in Fig. 3-21. Transition O consists of the channel losing its contents. 
Transition 1 consists of the channel correctly delivering packet 0 to the receiver, with the receiver then changing 
its state to expect frame 1 and emitting an acknowledgement. Transition 1 also corresponds to the receiver 
delivering packet 0 to the network layer. The other transitions are listed in Fig. 3-21(b). The arrival of a frame 
with a checksum error has not been shown because it does not change the state (in protocol 3). 


During normal operation, transitions 1, 2, 3, and 4 are repeated in order over and over. In each cycle, two 
packets are delivered, bringing the sender back to the initial state of trying to send a new frame with sequence 
number 0. If the channel loses frame 0, it makes a transition from state (000) to state (00—). Eventually, the 
sender times out (transition 7) and the system moves back to (000). The loss of an acknowledgement is more 
complicated, requiring two transitions, 7 and 5, or 8 and 6, to repair the damage. 


One of the properties that a protocol with a 1-bit sequence number must have is that no matter what sequence 
of events happens, the receiver never delivers two odd packets without an intervening even packet, and vice 
versa. From the graph of Fig. 3-21 we see that this requirement can be stated more formally as "there must not 
exist any paths from the initial state on which two occurrences of transition 1 occur without an occurrence of 
transition 3 between them, or vice versa." From the figure it can be seen that the protocol is correct in this 
respect. 


A similar requirement is that there not exist any paths on which the sender changes state twice (e.g., from 0 to 1 
and back to 0) while the receiver state remains constant. Were such a path to exist, then in the corresponding 
sequence of events, two frames would be irretrievably lost without the receiver noticing. The packet sequence 
delivered would have an undetected gap of two packets in it. 


Yet another important property of a protocol is the absence of deadlocks. A deadlock is a situation in which the 
protocol can make no more forward progress (i.e., deliver packets to the network layer) no matter what 
sequence of events happens. In terms of the graph model, a deadlock is characterized by the existence of a 
subset of states that is reachable from the initial state and that has two properties: 
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1. There is no transition out of the subset. 
2. There are no transitions in the subset that cause forward progress. 


Once in the deadlock situation, the protocol remains there forever. Again, it is easy to see from the graph that 
protocol 3 does not suffer from deadlocks. 


3.5.2 Petri Net Models 


The finite state machine is not the only technique for formally specifying protocols. In this section we will 
describe a completely different technique, the Petri net (Danthine, 1980). A Petri net has four basic elements: 
places, transitions, arcs, and tokens. A place represents a state which (part of) the system may be in. Figure 3- 
22 shows a Petri net with two places, A and B, both shown as circles. The system is currently in state A, 
indicated by the token (heavy dot) in place A. A transition is indicated by a horizontal or vertical bar. Each 
transition has zero or more input arcs coming from its input places, and zero or more output arcs, going to its 
output places. 


Figure 3-22. A Petri net with two places and two transitions. 


A transition is enabled if there is at least one input token in each of its input places. Any enabled transition may 
fire at will, removing one token from each input place and depositing a token in each output place. If the number 
of input arcs and output arcs differs, tokens will not be conserved. If two or more transitions are enabled, any 
one of them may fire. The choice of a transition to fire is indeterminate, which is why Petri nets are useful for 
modeling protocols. The Petri net of Fig. 3-22 is deterministic and can be used to model any two-phase process 
(e.g., the behavior of a baby: eat, sleep, eat, sleep, and so on). As with all modeling tools, unnecessary detail is 
suppressed. 


Figure 3-23 gives the Petri net model of Fig. 3-12. Unlike the finite state machine model, there are no composite 
states here; the sender's state, channel state, and receiver's state are represented separately. Transitions 1 and 
2 correspond to transmission of frame 0 by the sender, normally, and on a timeout respectively. Transitions 3 
and 4 are analogous for frame 1. Transitions 5, 6, and 7 correspond to the loss of frame 0, an 
acknowledgement, and frame 1, respectively. Transitions 8 and 9 occur when a data frame with the wrong 
sequence number arrives at the receiver. Transitions 10 and 11 represent the arrival at the receiver of the next 
frame in sequence and its delivery to the network layer. 
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Figure 3-23. A Petri net model for protocol 3. 


C: Seq 0 on the line 
D: Ack on the line 
E: Seq 1 on the line 


Emit 0 Process 0 
Wait Expect 1 
for 
Ack 0 
Emit 1 11] Process 1 
Wait 4 Expect 0 
for — 
Ack 1 | 

Timeout Reject 1 


ri Loss 


x 7 ` + x + 


Sender's Channelstate Receiver's 
state state 


Petri nets can be used to detect protocol failures in a way similar to the use of finite state machines. For 
example, if some firing sequence included transition 10 twice without transition 11 intervening, the protocol 
would be incorrect. The concept of a deadlock in a Petri net is similar to its finite state machine counterpart. 


Petri nets can be represented in convenient algebraic form resembling a grammar. Each transition contributes 
one rule to the grammar. Each rule specifies the input and output places of the transition. Since Fig. 3-23 has 11 
transitions, its grammar has 11 rules, numbered 1—11, each one corresponding to the transition with the same 
number. The grammar for the Petri net of Fig. 3-23 is as follows: 


(Q0 do gre orm 
O 


It is interesting to note how we have managed to reduce a complex protocol to 11 simple grammar rules that can 
easily be manipulated by a computer program. 
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The current state of the Petri net is represented as an unordered collection of places, each place represented in 
the collection as many times as it has tokens. Any rule, all of whose left-hand side places are present can be 
fired, removing those places from the current state, and adding its output places to the current state. The 
marking of Fig. 3-23 is ACG, (i.e., A, C, and G each have one token). Consequently, rules 2, 5, and 10 are all 
enabled and any of them can be applied, leading to a new state (possibly with the same marking as the original 
one). In contrast, rule 3 (AD —3BE ) cannot be applied because D is not marked. 


3.6 Example Data Link Protocols 


In the following sections we will examine several widely-used data link protocols. The first one, HDLC, is a 
classical bit-oriented protocol whose variants have been in use for decades in many applications. The second 
one, PPP, is the data link protocol used to connect home computers to the Internet. 


3.6.1 HDLC—High-Level Data Link Control 


In this section we will examine a group of closely related protocols that are a bit old but are still heavily used. 
They are all derived from the data link protocol first used in the IBM mainframe world: SDLC (Synchronous Data 
Link Control) protocol. After developing SDLC, IBM submitted it to ANSI and ISO for acceptance as U.S. and 
international standards, respectively. ANSI modified it to become ADCCP (Advanced Data Communication 
Control Procedure), and ISO modified it to become HDLC (High-level Data Link Control). CCITT then adopted 
and modified HDLC for its LAP (Link Access Procedure) as part of the X.25 network interface standard but later 
modified it again to LAPB, to make it more compatible with a later version of HDLC. The nice thing about 
standards is that you have so many to choose from. Furthermore, if you do not like any of them, you can just 
wait for next year's model. 


These protocols are based on the same principles. All are bit oriented, and all use bit stuffing for data 
transparency. They differ only in minor, but nevertheless irritating, ways. The discussion of bit-oriented protocols 
that follows is intended as a general introduction. For the specific details of any one protocol, please consult the 
appropriate definition. 


All the bit-oriented protocols use the frame structure shown in Fig. 3-24. The Address field is primarily of 
importance on lines with multiple terminals, where it is used to identify one of the terminals. For point-to-point 
lines, it is sometimes used to distinguish commands from responses. 

Figure 3-24. Frame format for bit-oriented protocols. 


Bits 8 8 8 20 16 8 


01111110 | Address | Control | Data | Checksum | 01111110 


The Control field is used for sequence numbers, acknowledgements, and other purposes, as discussed below. 


The Data field may contain any information. It may be arbitrarily long, although the efficiency of the checksum 
falls off with increasing frame length due to the greater probability of multiple burst errors. 


The Checksum field is a cyclic redundancy code using the technique we examined in Sec. 3-2.2. 


The frame is delimited with another flag sequence (01111110). On idle point-to-point lines, flag sequences are 
transmitted continuously. The minimum frame contains three fields and totals 32 bits, excluding the flags on 
either end. 


There are three kinds of frames: Information, Supervisory, and Unnumbered. The contents of the Control field for 
these three kinds are shown in Fig. 3-25. The protocol uses a sliding window, with a 3-bit sequence number. Up 
to seven unacknowledged frames may be outstanding at any instant. The Seq field in Fig. 3-25(a) is the frame 
sequence number. The Next field is a piggybacked acknowledgement. However, all the protocols adhere to the 
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convention that instead of piggybacking the number of the last frame received correctly, they use the number of 
the first frame not yet received (i.e., the next frame expected). The choice of using the last frame received or the 
next frame expected is arbitrary; it does not matter which convention is used, provided that it is used 
consistently. 


Figure 3-25. Control field of (a) an information frame, (b) a supervisory frame, (c) an unnumbered frame. 


Bits 1 3 1 3 
(a)| O Seq P/F Next 
(b)| 1 0 Type P/F Next 
(c) 1 1 | Type P/F Modifier 


The P/F bit stands for Poll/Final. It is used when a computer (or concentrator) is polling a group of terminals. 
When used as P, the computer is inviting the terminal to send data. All the frames sent by the terminal, except 
the final one, have the P/F bit set to P. The final one is setto F. 


In some of the protocols, the P/F bit is used to force the other machine to send a Supervisory frame immediately 
rather than waiting for reverse traffic onto which to piggyback the window information. The bit also has some 
minor uses in connection with the Unnumbered frames. 


The various kinds of Supervisory frames are distinguished by the Type field. Type 0 is an acknowledgement 
frame (officially called RECEIVE READY) used to indicate the next frame expected. This frame is used when 
there is no reverse traffic to use for piggybacking. 


Type 1 is a negative acknowledgement frame (officially called REJECT). It is used to indicate that a transmission 
error has been detected. The Next field indicates the first frame in sequence not received correctly (i.e., the 
frame to be retransmitted). The sender is required to retransmit all outstanding frames starting at Next. This 
strategy is similar to our protocol 5 rather than our protocol 6. 


Type 2 is RECEIVE NOT READY. It acknowledges all frames up to but not including Next, just as RECEIVE 
READY does, but it tells the sender to stop sending. RECEIVE NOT READY is intended to signal certain 
temporary problems with the receiver, such as a shortage of buffers, and not as an alternative to the sliding 
window flow control. When the condition has been repaired, the receiver sends a RECEIVE READY, REJECT, 
or certain control frames. 


Type 3 is the SELECTIVE REJECT. It calls for retransmission of only the frame specified. In this sense it is like 
our protocol 6 rather than 5 and is therefore most useful when the sender's window size is half the sequence 
space size, or less. Thus, if a receiver wishes to buffer out-of-sequence frames for potential future use, it can 
force the retransmission of any specific frame using Selective Reject. HDLC and ADCCP allow this frame type, 
but SDLC and LAPB do not allow it (i.e., there is no Selective Reject), and type 3 frames are undefined. 


The third class of frame is the Unnumbered frame. It is sometimes used for control purposes but can also carry 
data when unreliable connectionless service is called for. The various bit-oriented protocols differ considerably 
here, in contrast with the other two kinds, where they are nearly identical. Five bits are available to indicate the 
frame type, but not all 32 possibilities are used. 


All the protocols provide a command, DISC (DISConnect), that allows a machine to announce that it is going 
down (e.g., for preventive maintenance). They also have a command that allows a machine that has just come 
back on-line to announce its presence and force all the sequence numbers back to zero. This command is called 
SNRM (Set Normal Response Mode). Unfortunately, "Normal Response Mode" is anything but normal. It is an 
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unbalanced (i.e., asymmetric) mode in which one end of the line is the master and the other the slave. SNRM 
dates from a time when data communication meant a dumb terminal talking to a big host computer, which clearly 
is asymmetric. To make the protocol more suitable when the two partners are equals, HDLC and LAPB have an 
additional command, SABM (Set Asynchronous Balanced Mode), which resets the line and declares both parties 
to be equals. They also have commands SABME and SNRME, which are the same as SABM and SNRM, 
respectively, except that they enable an extended frame format that uses 7-bit sequence numbers instead of 3- 
bit sequence numbers. 


A third command provided by all the protocols is FRMR (FRaMe Reject), used to indicate that a frame with a 
correct checksum but impossible semantics arrived. Examples of impossible semantics are a type 3 Supervisory 


frame in LAPB, a frame shorter than 32 bits, an illegal control frame, and an acknowledgement of a frame that 
was outside the window, etc. FRMR frames contain a 24-bit data field telling what was wrong with the frame. The 
data include the control field of the bad frame, the window parameters, and a collection of bits used to signal 
specific errors. 


Control frames can be lost or damaged, just like data frames, so they must be acknowledged too. A special 
control frame, called UA (Unnumbered Acknowledgement), is provided for this purpose. Since only one control 
frame may be outstanding, there is never any ambiguity about which control frame is being acknowledged. 


The remaining control frames deal with initialization, polling, and status reporting. There is also a control frame 
that may contain arbitrary information, UI (Unnumbered Information). These data are not passed to the network 
layer but are for the receiving data link layer itself. 


Despite its widespread use, HDLC is far from perfect. A discussion of a variety of problems associated with it 
can be found in (Fiorini et al., 1994). 


3.6.2 The Data Link Layer in the Internet 


The Internet consists of individual machines (hosts and routers) and the communication infrastructure that 
connects them. Within a single building, LANs are widely used for interconnection, but most of the wide area 
infrastructure is built up from point-to-point leased lines. In Chap. 4, we will look at LANs; here we will examine 
the data link protocols used on point-to-point lines in the Internet. 


In practice, point-to-point communication is primarily used in two situations. First, thousands of organizations 
have one or more LANs, each with some number of hosts (personal computers, user workstations, servers, and 
so on) along with a router (or a bridge, which is functionally similar). Often, the routers are interconnected by a 
backbone LAN. Typically, all connections to the outside world go through one or two routers that have point-to- 
point leased lines to distant routers. It is these routers and their leased lines that make up the communication 
subnets on which the Internet is built. 


The second situation in which point-to-point lines play a major role in the Internet is the millions of individuals 
who have home connections to the Internet using modems and dial-up telephone lines. Usually, what happens is 
that the user's home PC calls up an Internet service provider's router and then acts like a full-blown Internet host. 
This method of operation is no different from having a leased line between the PC and the router, except that the 
connection is terminated when the user ends the session. A home PC calling an Internet service provider is 
illustrated in Fig. 3-26. The modem is shown external to the computer to emphasize its role, but modern 
computers have internal modems. 


Figure 3-26. A home personal computer acting as an Internet host. 
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For both the router-router leased line connection and the dial-up host-router connection, some point-to-point 
data link protocol is required on the line for framing, error control, and the other data link layer functions we have 
studied in this chapter. The one used in the Internet is called PPP. We will now examine it. 


PPP—The Point-to-Point Protocol 


The Internet needs a point-to-point protocol for a variety of purposes, including router-to-router traffic and home 
user-to-ISP traffic. This protocol is PPP (Point-to-Point Protocol), which is defined in RFC 1661 and further 
elaborated on in several other RFCs (e.g., RFCs 1662 and 1663). PPP handles error detection, supports 
multiple protocols, allows IP addresses to be negotiated at connection time, permits authentication, and has 
many other features. 


PPP provides three features: 


1. Aframing method that unambiguously delineates the end of one frame and the start of the next one. The 
frame format also handles error detection. 

2. A link control protocol for bringing lines up, testing them, negotiating options, and bringing them down 
again gracefully when they are no longer needed. This protocol is called LCP (Link Control Protocol). It 
supports synchronous and asynchronous circuits and byte-oriented and bit-oriented encodings. 

3. Away to negotiate network-layer options in a way that is independent of the network layer protocol to be 
used. The method chosen is to have a different NCP (Network Control Protocol) for each network layer 
supported. 


To see how these pieces fit together, let us consider the typical scenario of a home user calling up an Internet 
service provider to make a home PC a temporary Internet host. The PC first calls the provider's router via a 
modem. After the router's modem has answered the phone and established a physical connection, the PC sends 
the router a series of LCP packets in the payload field of one or more PPP frames. These packets and their 
responses select the PPP parameters to be used. 


Once the parameters have been agreed upon, a series of NCP packets are sent to configure the network layer. 
Typically, the PC wants to run a TCP/IP protocol stack, so it needs an IP address. There are not enough IP 
addresses to go around, so normally each Internet provider gets a block of them and then dynamically assigns 
one to each newly attached PC for the duration of its login session. If a provider owns n IP addresses, it can 
have up to n machines logged in simultaneously, but its total customer base may be many times that. The NCP 
for IP assigns the IP address. 


At this point, the PC is now an Internet host and can send and receive IP packets, just as hardwired hosts can. 
When the user is finished, NCP tears down the network layer connection and frees up the IP address. Then LCP 
shuts down the data link layer connection. Finally, the computer tells the modem to hang up the phone, releasing 
the physical layer connection. 


The PPP frame format was chosen to closely resemble the HDLC frame format, since there was no reason to 
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reinvent the wheel. The major difference between PPP and HDLC is that PPP is character oriented rather than 
bit oriented. In particular, PPP uses byte stuffing on dial-up modem lines, so all frames are an integral number of 
bytes. It is not possible to send a frame consisting of 30.25 bytes, as it is with HDLC. Not only can PPP frames 
be sent over dial-up telephone lines, but they can also be sent over SONET or true bit-oriented HDLC lines (e.g., 
for router-router connections). The PPP frame format is shown in Fig. 3-27. 


Figure 3-27. The PPP full frame format for unnumbered mode operation. 
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All PPP frames begin with the standard HDLC flag byte (01111110), which is byte stuffed if it occurs within the 
payload field. Next comes the Address field, which is always set to the binary value 11111111 to indicate that all 
stations are to accept the frame. Using this value avoids the issue of having to assign data link addresses. 


The Address field is followed by the Control field, the default value of which is 00000011. This value indicates an 
unnumbered frame. In other words, PPP does not provide reliable transmission using sequence numbers and 


acknowledgements as the default. In noisy environments, such as wireless networks, reliable transmission using 
numbered mode can be used. The exact details are defined in RFC 1663, but in practice it is rarely used. 


Since the Address and Control fields are always constant in the default configuration, LCP provides the 
necessary mechanism for the two parties to negotiate an option to just omit them altogether and save 2 bytes 
per frame. 


The fourth PPP field is the Protocol field. Its job is to tell what kind of packet is in the Payload field. Codes are 
defined for LCP, NCP, IP, IPX, AppleTalk, and other protocols. Protocols starting with a O bit are network layer 
protocols such as IP, IPX, OSI CLNP, XNS. Those starting with a 1 bit are used to negotiate other protocols. 
These include LCP and a different NCP for each network layer protocol supported. The default size of the 
Protocol field is 2 bytes, but it can be negotiated down to 1 byte using LCP. 


The Payload field is variable length, up to some negotiated maximum. If the length is not negotiated using LCP 
during line setup, a default length of 1500 bytes is used. Padding may follow the payload if need be. 


After the Payload field comes the Checksum field, which is normally 2 bytes, but a 4-byte checksum can be 
negotiated. 


In summary, PPP is a multiprotocol framing mechanism suitable for use over modems, HDLC bit-serial lines, 
SONET, and other physical layers. It supports error detection, option negotiation, header compression, and, 
optionally, reliable transmission using an HDLC-type frame format. 


Let us now turn from the PPP frame format to the way lines are brought up and down. The (simplified) diagram 
of Fig. 3-28 shows the phases that a line goes through when it is brought up, used, and taken down again. This 
sequence applies both to modem connections and to router-router connections. 


Figure 3-28. A simplified phase diagram for bringing a line up and down. 
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The protocol starts with the line in the DEAD state, which means that no physical layer carrier is present and no 
physical layer connection exists. After physical connection is established, the line moves to ESTABLISH. At that 
point LCP option negotiation begins, which, if successful, leads to AUTHENTICATE. Now the two parties can 
check on each other's identities if desired. When the NETWORK phase is entered, the appropriate NCP protocol 
is invoked to configure the network layer. If the configuration is successful, OPEN is reached and data transport 
can take place. When data transport is finished, the line moves into the TERMINATE phase, and from there, 
back to DEAD when the carrier is dropped. 


LCP negotiates data link protocol options during the ESTABLISH phase. The LCP protocol is not actually 
concerned with the options themselves, but with the mechanism for negotiation. It provides a way for the 
initiating process to make a proposal and for the responding process to accept or reject it, in whole or in part. It 
also provides a way for the two processes to test the line quality to see if they consider it good enough to set up 
a connection. Finally, the LCP protocol also allows lines to be taken down when they are no longer needed. 


Eleven types of LCP frames are defined in RFC 1661. These are listed in Fig. 3-29. The four Configure- types 
allow the initiator (I) to propose option values and the responder (R) to accept or reject them. In the latter case, 
the responder can make an alternative proposal or announce that it is not willing to negotiate certain options at 
all. The options being negotiated and their proposed values are part of the LCP frames. 


Figure 3-29. The LCP frame types. 


Name Direction Description 
Configure-request |—R List of proposed options and values 
Configure-ack lR All options are accepted 
Configure-nak leR Some options are not accepted 
Configure-reject lR Some options are not negotiable 
Terminate-request | | 5 R Request to shut the line down 
Terminate-ack I— R OK, line shut down 
Code-reject lR Unknown request received 
Protocol-reject I - R Unknown protocol requested 
Echo-request | —^R Please send this frame back 
Echo-reply lR Here is the frame back 
Discard-request I>R Just discard this frame (for testing) 


The Terminate- codes shut a line down when it is no longer needed. The Code-reject and Protocol-reject codes 

indicate that the responder got something that it does not understand. This situation could mean that an 

undetected transmission error has occurred, but more likely it means that the initiator and responder are running 

different versions of the LCP protocol. The Echo- types are used to test the line quality. Finally, Discard-request 

help debugging. If either end is having trouble getting bits onto the wire, the programmer can use this type for 
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testing. If it manages to get through, the receiver just throws it away, rather than taking some other action that 
might confuse the person doing the testing. 


The options that can be negotiated include setting the maximum payload size for data frames, enabling 
authentication and choosing a protocol to use, enabling line-quality monitoring during normal operation, and 
selecting various header compression options. 


There is little to say about the NCP protocols in a general way. Each one is specific to some network layer 
protocol and allows configuration requests to be made that are specific to that protocol. For IP, for example, 
dynamic address assignment is the most important possibility. 


3.7 Summary 


The task of the data link layer is to convert the raw bit stream offered by the physical layer into a stream of 
frames for use by the network layer. Various framing methods are used, including character count, byte stuffing, 
and bit stuffing. Data link protocols can provide error control to retransmit damaged or lost frames. To prevent a 
fast sender from overrunning a slow receiver, the data link protocol can also provide flow control. The sliding 
window mechanism is widely used to integrate error control and flow control in a convenient way. 


Sliding window protocols can be categorized by the size of the sender's window and the size of the receiver's 
window. When both are equal to 1, the protocol is stop-and-wait. When the sender's window is greater than 1, 
for example, to prevent the sender from blocking on a circuit with a long propagation delay, the receiver can be 
programmed either to discard all frames other than the next one in sequence or to buffer out-of-order frames 
until they are needed. 


We examined a series of protocols in this chapter. Protocol 1 is designed for an error-free environment in which 
the receiver can handle any flow sent to it. Protocol 2 still assumes an error-free environment but introduces flow 
control. Protocol 3 handles errors by introducing sequence numbers and using the stop-and-wait algorithm. 
Protocol 4 allows bidirectional communication and introduces the concept of piggybacking. Protocol 5 uses a 


sliding window protocol with go back n. Finally, protocol 6 uses selective repeat and negative 
acknowledgements. 


Protocols can be modeled using various techniques to help demonstrate their correctness (or lack thereof). 
Finite state machine models and Petri net models are commonly used for this purpose. 


Many networks use one of the bit-oriented protocols—SDLC, HDLC, ADCCP, or LAPB—at the data link level. All 
of these protocols use flag bytes to delimit frames, and bit stuffing to prevent flag bytes from occurring in the 
data. All of them also use a sliding window for flow control. The Internet uses PPP as the primary data link 
protocol over point-to-point lines. 


1. 
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Chapter 4. The Medium Access Control 
Sublayer 


4.1 Multiple Access Protocols 


Many algorithms for allocating a multiple access channel are known. In the following sections 
we will study a small sample of the more interesting ones and give some examples of their 
use. 


4.1.1 ALOHA 


In the 1970s, Norman Abramson and his colleagues at the University of Hawaii devised a new 
and elegant method to solve the channel allocation problem. Their work has been extended by 
many researchers since then (Abramson, 1985). Although Abramson's work, called the ALOHA 
system, used ground-based radio broadcasting, the basic idea is applicable to any system in 
which uncoordinated users are competing for the use of a single shared channel. 


We will discuss two versions of ALOHA here: pure and slotted. They differ with respect to 
whether time is divided into discrete slots into which all frames must fit. Pure ALOHA does not 
require global time synchronization; slotted ALOHA does. 


Pure ALOHA 


The basic idea of an ALOHA system is simple: let users transmit whenever they have data to 
be sent. There will be collisions, of course, and the colliding frames will be damaged. However, 
due to the feedback property of broadcasting, a sender can always find out whether its frame 
was destroyed by listening to the channel, the same way other users do. With a LAN, the 
feedback is immediate; with a satellite, there is a delay of 270 msec before the sender knows 
if the transmission was successful. If listening while transmitting is not possible for some 
reason, acknowledgements are needed. If the frame was destroyed, the sender just waits a 
random amount of time and sends it again. The waiting time must be random or the same 
frames will collide over and over, in lockstep. Systems in which multiple users share a common 
channel in a way that can lead to conflicts are widely known as contention systems. 


A sketch of frame generation in an ALOHA system is given in Fig. 4-1. We have made the 
frames all the same length because the throughput of ALOHA systems is maximized by having 
a uniform frame size rather than by allowing variable length frames. 


Figure 4-1. In pure ALOHA, frames are transmitted at completely 
arbitrary times. 
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Whenever two frames try to occupy the channel at the same time, there will be a collision and 
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both will be garbled. If the first bit of a new frame overlaps with just the last bit of a frame 
almost finished, both frames will be totally destroyed and both will have to be retransmitted 
later. The checksum cannot (and should not) distinguish between a total loss and a near miss. 
Bad is bad. 


An interesting question is: What is the efficiency of an ALOHA channel? In other words, what 
fraction of all transmitted frames escape collisions under these chaotic circumstances? Let us 
first consider an infinite collection of interactive users sitting at their computers (stations). A 
user is always in one of two states: typing or waiting. Initially, all users are in the typing state. 
When a line is finished, the user stops typing, waiting for a response. The station then 
transmits a frame containing the line and checks the channel to see if it was successful. If so, 
the user sees the reply and goes back to typing. If not, the user continues to wait and the 
frame is retransmitted over and over until it has been successfully sent. 


Let the "frame time" denote the amount of time needed to transmit the standard, fixed-length 
frame (i.e., the frame length divided by the bit rate). At this point we assume that the infinite 
population of users generates new frames according to a Poisson distribution with mean N 
frames per frame time. (The infinite-population assumption is needed to ensure that N does 
not decrease as users become blocked.) If N » 1, the user community is generating frames at 
a higher rate than the channel can handle, and nearly every frame will suffer a collision. For 
reasonable throughput we would expect 0 < N < 1. 


In addition to the new frames, the stations also generate retransmissions of frames that 
previously suffered collisions. Let us further assume that the probability of k transmission 
attempts per frame time, old and new combined, is also Poisson, with mean G per frame time. 


Clearly, G 2y. At low load (i.e., N ~0), there will be few collisions, hence few 


retransmissions, so G ZZN. At high load there will be many collisions, so G » N. Under all 
loads, the throughput, S, is just the offered load, G, times the probability, Po, of a transmission 
succeeding—that is, S = GPo, where Po is the probability that a frame does not suffer a 
collision. 


A frame will not suffer a collision if no other frames are sent within one frame time of its start, 
as shown in Fig. 4-2. Under what conditions will the shaded frame arrive undamaged? Let t be 
the time required to send a frame. If any other user has generated a frame between time to 
and to + t, the end of that frame will collide with the beginning of the shaded one. In fact, the 
shaded frame's fate was already sealed even before the first bit was sent, but since in pure 
ALOHA a station does not listen to the channel before transmitting, it has no way of knowing 
that another frame was already underway. Similarly, any other frame started between to + t 
and to + 2t will bump into the end of the shaded frame. 


Figure 4-2. Vulnerable period for the shaded frame. 
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The probability that k frames are generated during a given frame time is given by the Poisson 
distribution: 
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Equation 4 


G* eG 
Pr[k ] = M 


so the probability of zero frames is just e€. In an interval two frame times long, the mean 
number of frames generated is 2G. The probability of no other traffic being initiated during the 
entire vulnerable period is thus given by Po = e 9, Using S = GPo, we get 


> ~ —2G 
S = Ge ™ 


The relation between the offered traffic and the throughput is shown in Fig. 4-3. The maximum 
throughput occurs at G = 0.5, with S = 1/2e, which is about 0.184. In other words, the best 
we can hope for is a channel utilization of 18 percent. This result is not very encouraging, but 
with everyone transmitting at will, we could hardly have expected a 100 percent success rate. 


Slotted ALOHA 


In 1972, Roberts published a method for doubling the capacity of an ALOHA system (Roberts, 
1972). His proposal was to divide time into discrete intervals, each interval corresponding to 
one frame. This approach requires the users to agree on slot boundaries. One way to achieve 
synchronization would be to have one special station emit a pip at the start of each interval, 

like a clock. 


In Roberts' method, which has come to be known as slotted ALOHA, in contrast to 
Abramson's pure ALOHA, a computer is not permitted to send whenever a carriage return is 
typed. Instead, it is required to wait for the beginning of the next slot. Thus, the continuous 
pure ALOHA is turned into a discrete one. Since the vulnerable period is now halved, the 
probability of no other traffic during the same slot as our test frame is e'6 which leads to 


Equation 4 


S = Ge] 


As you can see from Fig. 4-3, slotted ALOHA peaks at G = 1, with a throughput of S =1/e or 
about 0.368, twice that of pure ALOHA. If the system is operating at G = 1, the probability of 
an empty slot is 0.368 (from Eq. 4-2). The best we can hope for using slotted ALOHA is 37 
percent of the slots empty, 37 percent successes, and 26 percent collisions. Operating at 
higher values of G reduces the number of empties but increases the number of collisions 
exponentially. To see how this rapid growth of collisions with G comes about, consider the 
transmission of a test frame. The probability that it will avoid a collision is eF, the probability 
that all the other users are silent in that slot. The probability of a collision is then just 1 - e°°. 
The probability of a transmission requiring exactly k attempts, (i.e., k - 1 collisions followed by 
one success) is 
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Figure 4-3. Throughput versus offered traffic for ALOHA systems. 
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The expected number of transmissions, E, per carriage return typed is then 
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As a result of the exponential dependence of E upon G, small increases in the channel load can 
drastically reduce its performance. 


Slotted Aloha is important for a reason that may not be initially obvious. It was devised in the 
1970s, used in a few early experimental systems, then almost forgotten. When Internet access 
over the cable was invented, all of a sudden there was a problem of how to allocate a shared 
channel among multiple competing users, and slotted Aloha was pulled out of the garbage can 
to save the day. It has often happened that protocols that are perfectly valid fall into disuse for 
political reasons (e.g., some big company wants everyone to do things its way), but years later 
some clever person realizes that a long-discarded protocol solves his current problem. For this 
reason, in this chapter we will study a number of elegant protocols that are not currently in 
widespread use, but might easily be used in future applications, provided that enough network 
designers are aware of them. Of course, we will also study many protocols that are in current 
use as well. 


4.1.2 Carrier Sense Multiple Access Protocols 


With slotted ALOHA the best channel utilization that can be achieved is 1/e. This is hardly 
surprising, since with stations transmitting at will, without paying attention to what the other 
stations are doing, there are bound to be many collisions. In local area networks, however, it 
is possible for stations to detect what other stations are doing, and adapt their behavior 
accordingly. These networks can achieve a much better utilization than 1/e. In this section we 
will discuss some protocols for improving performance. 


Protocols in which stations listen for a carrier (i.e., a transmission) and act accordingly are 
called carrier sense protocols. A number of them have been proposed. Kleinrock and Tobagi 
(1975) have analyzed several such protocols in detail. Below we will mention several versions 
of the carrier sense protocols. 
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Persistent and Nonpersistent CSMA 


The first carrier sense protocol that we will study here is called 1-persistent CSMA (Carrier 
Sense Multiple Access). When a station has data to send, it first listens to the channel to see if 
anyone else is transmitting at that moment. If the channel is busy, the station waits until it 
becomes idle. When the station detects an idle channel, it transmits a frame. If a collision 
occurs, the station waits a random amount of time and starts all over again. The protocol is 
called 1-persistent because the station transmits with a probability of 1 when it finds the 
channel idle. 


The propagation delay has an important effect on the performance of the protocol. There is a 
small chance that just after a station begins sending, another station will become ready to 
send and sense the channel. If the first station's signal has not yet reached the second one, 
the latter will sense an idle channel and will also begin sending, resulting in a collision. The 
longer the propagation delay, the more important this effect becomes, and the worse the 
performance of the protocol. 


Even if the propagation delay is zero, there will still be collisions. If two stations become ready 
in the middle of a third station's transmission, both will wait politely until the transmission 
ends and then both will begin transmitting exactly simultaneously, resulting in a collision. If 
they were not so impatient, there would be fewer collisions. Even so, this protocol is far better 
than pure ALOHA because both stations have the decency to desist from interfering with the 
third station's frame. Intuitively, this approach will lead to a higher performance than pure 
ALOHA. Exactly the same holds for slotted ALOHA. 


A second carrier sense protocol is nonpersistent CSMA. In this protocol, a conscious attempt 
is made to be less greedy than in the previous one. Before sending, a station senses the 
channel. If no one else is sending, the station begins doing so itself. However, if the channel is 
already in use, the station does not continually sense it for the purpose of seizing it 
immediately upon detecting the end of the previous transmission. Instead, it waits a random 
period of time and then repeats the algorithm. Consequently, this algorithm leads to better 
channel utilization but longer delays than 1-persistent CSMA. 


The last protocol is p-persistent CSMA. It applies to slotted channels and works as follows. 
When a station becomes ready to send, it senses the channel. If it is idle, it transmits with a 
probability p. With a probability q = 1 - p, it defers until the next slot. If that slot is also idle, it 
either transmits or defers again, with probabilities p and q. This process is repeated until either 
the frame has been transmitted or another station has begun transmitting. In the latter case, 
the unlucky station acts as if there had been a collision (i.e., it waits a random time and starts 
again). If the station initially senses the channel busy, it waits until the next slot and applies 
the above algorithm. Figure 4-4 shows the computed throughput versus offered traffic for all 
three protocols, as well as for pure and slotted ALOHA. 


Figure 4-4. Comparison of the channel utilization versus load for 
various random access protocols. 


0.01-persistent CSMA 
Nonpersistent CSMA 


0.1-persistent CSMA 


/ Slotted 
0.4 — ALOHA 


ughput per packet time) 
o 
o 
I 


S (thro 
o 
m 
| 


150 


CSMA with Collision Detection 


Persistent and nonpersistent CSMA protocols are clearly an improvement over ALOHA because 
they ensure that no station begins to transmit when it senses the channel busy. Another 
improvement is for stations to abort their transmissions as soon as they detect a collision. In 
other words, if two stations sense the channel to be idle and begin transmitting 
simultaneously, they will both detect the collision almost immediately. Rather than finish 
transmitting their frames, which are irretrievably garbled anyway, they should abruptly stop 
transmitting as soon as the collision is detected. Quickly terminating damaged frames saves 
time and bandwidth. This protocol, known as CSMA/CD (CSMA with Collision Detection) is 
widely used on LANs in the MAC sublayer. In particular, it is the basis of the popular Ethernet 
LAN, so it is worth devoting some time to looking at it in detail. 


CSMA/CD, as well as many other LAN protocols, uses the conceptual model of Fig. 4-5. At the 
point marked to, a station has finished transmitting its frame. Any other station having a frame 
to send may now attempt to do so. If two or more stations decide to transmit simultaneously, 
there will be a collision. Collisions can be detected by looking at the power or pulse width of 
the received signal and comparing it to the transmitted signal. 


Figure 4-5. CSMA/CD can be in one of three states: contention, 
transmission, or idle. 
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After a station detects a collision, it aborts its transmission, waits a random period of time, and 
then tries again, assuming that no other station has started transmitting in the meantime. 
Therefore, our model for CSMA/CD will consist of alternating contention and transmission 
periods, with idle periods occurring when all stations are quiet (e.g., for lack of work). 


Now let us look closely at the details of the contention algorithm. Suppose that two stations 
both begin transmitting at exactly time to. How long will it take them to realize that there has 
been a collision? The answer to this question is vital to determining the length of the 
contention period and hence what the delay and throughput will be. The minimum time to 


detect the collision is then just the time it takes the signal to propagate from one station to the 
other. 


Based on this reasoning, you might think that a station not hearing a collision for a time equal 
to the full cable propagation time after starting its transmission could be sure it had seized the 
cable. By "seized," we mean that all other stations knew it was transmitting and would not 
interfere. This conclusion is wrong. Consider the following worst-case scenario. Let the time for 
a signal to propagate between the two farthest stations be t. At to, one station begins 
transmitting. At « - e, an instant before the signal arrives at the most distant station, that 
station also begins transmitting. Of course, it detects the collision almost instantly and stops, 
but the little noise burst caused by the collision does not get back to the original station until 
time 2t - e. In other words, in the worst case a station cannot be sure that it has seized the 
channel until it has transmitted for 2x without hearing a collision. For this reason we will model 
the contention interval as a slotted ALOHA system with slot width 2x. On a 1-km long coaxial 


cable, « A5 usec. For simplicity we will assume that each slot contains just 1 bit. Once the 
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channel has been seized, a station can transmit at any rate it wants to, of course, not just at 1 
bit per 2t sec. 


It is important to realize that collision detection is an analog process. The station's hardware 
must listen to the cable while it is transmitting. If what it reads back is different from what it is 
putting out, it knows that a collision is occurring. The implication is that the signal encoding 
must allow collisions to be detected (e.g., a collision of two O-volt signals may well be 
impossible to detect). For this reason, special encoding is commonly used. 


It is also worth noting that a sending station must continually monitor the channel, listening 
for noise bursts that might indicate a collision. For this reason, CSMA/CD with a single channel 
is inherently a half-duplex system. It is impossible for a station to transmit and receive frames 
at the same time because the receiving logic is in use, looking for collisions during every 
transmission. 


To avoid any misunderstanding, it is worth noting that no MAC-sublayer protocol guarantees 
reliable delivery. Even in the absence of collisions, the receiver may not have copied the frame 
correctly for various reasons (e.g., lack of buffer space or a missed interrupt). 


4.1.3 Collision-Free Protocols 


Although collisions do not occur with CSMA/CD once a station has unambiguously captured the 
channel, they can still occur during the contention period. These collisions adversely affect the 
system performance, especially when the cable is long (i.e., large x) and the frames are short. 
And CSMA/CD is not universally applicable. In this section, we will examine some protocols 
that resolve the contention for the channel without any collisions at all, not even during the 
contention period. Most of these are not currently used in major systems, but in a rapidly 
changing field, having some protocols with excellent properties available for future systems is 
often a good thing. 


In the protocols to be described, we assume that there are exactly N stations, each with a 
unique address from 0 to N - 1 "wired" into it. It does not matter that some stations may be 
inactive part of the time. We also assume that propagation delay is negligible. The basic 
question remains: Which station gets the channel after a successful transmission? We continue 
using the model of Fig. 4-5 with its discrete contention slots. 


A Bit-Map Protocol 


In our first collision-free protocol, the basic bit-map method, each contention period consists 
of exactly N slots. If station O has a frame to send, it transmits a 1 bit during the zeroth slot. 
No other station is allowed to transmit during this slot. Regardless of what station 0 does, 


station 1 gets the opportunity to transmit a 1 during slot 1, but only if it has a frame queued. 
In general, station j may announce that it has a frame to send by inserting a 1 bit into slot j. 
After all N slots have passed by, each station has complete knowledge of which stations wish 
to transmit. At that point, they begin transmitting in numerical order (see Fig. 4-6). 


Figure 4-6. The basic bit-map protocol. 
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Since everyone agrees on who goes next, there will never be any collisions. After the last 
ready station has transmitted its frame, an event all stations can easily monitor, another N bit 
contention period is begun. If a station becomes ready just after its bit slot has passed by, it is 
out of luck and must remain silent until every station has had a chance and the bit map has 
come around again. Protocols like this in which the desire to transmit is broadcast before the 
actual transmission are called reservation protocols. 


Let us briefly analyze the performance of this protocol. For convenience, we will measure time 
in units of the contention bit slot, with data frames consisting of d time units. Under conditions 
of low load, the bit map will simply be repeated over and over, for lack of data frames. 


Consider the situation from the point of view of a low-numbered station, such as O or 1. 
Typically, when it becomes ready to send, the "current" slot will be somewhere in the middle 
of the bit map. On average, the station will have to wait /V/2 slots for the current scan to finish 
and another full N slots for the following scan to run to completion before it may begin 
transmitting. 


The prospects for high-numbered stations are brighter. Generally, these will only have to wait 
half a scan (N/2 bit slots) before starting to transmit. High-numbered stations rarely have to 
wait for the next scan. Since low-numbered stations must wait on average 1.5N slots and high- 
numbered stations must wait on average 0.5N slots, the mean for all stations is N slots. The 
channel efficiency at low load is easy to compute. The overhead per frame is N bits, and the 
amount of data is d bits, for an efficiency of d/(N + d). 


At high load, when all the stations have something to send all the time, the N bit contention 
period is prorated over N frames, yielding an overhead of only 1 bit per frame, or an efficiency 
of d/(d + 1). The mean delay for a frame is equal to the sum of the time it queues inside its 
station, plus an additional N(d + 1)/2 once it gets to the head of its internal queue. 


Binary Countdown 


A problem with the basic bit-map protocol is that the overhead is 1 bit per station, so it does 
not scale well to networks with thousands of stations. We can do better than that by using 
binary station addresses. A station wanting to use the channel now broadcasts its address as a 
binary bit string, starting with the high-order bit. All addresses are assumed to be the same 
length. The bits in each address position from different stations are BOOLEAN ORed together. 
We will call this protocol binary countdown. It was used in Datakit (Fraser, 1987). It 
implicitly assumes that the transmission delays are negligible so that all stations see asserted 
bits essentially instantaneously. 


To avoid conflicts, an arbitration rule must be applied: as soon as a station sees that a high- 
order bit position that is O in its address has been overwritten with a 1, it gives up. For 
example, if stations 0010, 0100, 1001, and 1010 are all trying to get the channel, in the first 


bit time the stations transmit 0, 0, 1, and 1, respectively. These are ORed together to form a 
1. Stations 0010 and 0100 see the 1 and know that a higher-numbered station is competing 
for the channel, so they give up for the current round. Stations 1001 and 1010 continue. 


The next bit is O, and both stations continue. The next bit is 1, so station 1001 gives up. The 
winner is station 1010 because it has the highest address. After winning the bidding, it may 
now transmit a frame, after which another bidding cycle starts. The protocol is illustrated in 
Fig. 4-7. It has the property that higher-numbered stations have a higher priority than lower- 
numbered stations, which may be either good or bad, depending on the context. 
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Figure 4-7. The binary countdown protocol. A dash indicates silence. 
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The channel efficiency of this method is d/(d + log» N). If, however, the frame format has been 
cleverly chosen so that the sender's address is the first field in the frame, even these log» N 
bits are not wasted, and the efficiency is 100 percent. 


Mok and Ward (1979) have described a variation of binary countdown using a parallel rather 
than a serial interface. They also suggest using virtual station numbers, with the virtual station 
numbers from 0 up to and including the successful station being circularly permuted after each 
transmission, in order to give higher priority to stations that have been silent unusually long. 
For example, if stations C, H, D, A, G, B, E, F have priorities 7, 6, 5, 4, 3, 2, 1, and O, 
respectively, then a successful transmission by D puts it at the end of the list, giving a priority 
order of C, H, A, G, B, E, F, D. Thus, C remains virtual station 7, but A moves up from 4 to 5 
and D drops from 5 to 0. Station D will now only be able to acquire the channel if no other 
station wants it. 


Binary countdown is an example of a simple, elegant, and efficient protocol that is waiting to 
be rediscovered. Hopefully, it will find a new home some day. 


4.1.4 Limited-Contention Protocols 


We have now considered two basic strategies for channel acquisition in a cable network: 
contention, as in CSMA, and collision-free methods. Each strategy can be rated as to how well 
it does with respect to the two important performance measures, delay at low load and 
channel efficiency at high load. Under conditions of light load, contention (i.e., pure or slotted 
ALOHA) is preferable due to its low delay. As the load increases, contention becomes 
increasingly less attractive, because the overhead associated with channel arbitration becomes 
greater. Just the reverse is true for the collision-free protocols. At low load, they have high 
delay, but as the load increases, the channel efficiency improves rather than gets worse as it 
does for contention protocols. 


Obviously, it would be nice if we could combine the best properties of the contention and 
collision-free protocols, arriving at a new protocol that used contention at low load to provide 
low delay, but used a collision-free technique at high load to provide good channel efficiency. 
Such protocols, which we will call limited-contention protocols, do, in fact, exist, and will 
conclude our study of carrier sense networks. 


Up to now the only contention protocols we have studied have been symmetric, that is, each 
station attempts to acquire the channel with some probability, p, with all stations using the 
same p. Interestingly enough, the overall system performance can sometimes be improved by 
using a protocol that assigns different probabilities to different stations. 
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Before looking at the asymmetric protocols, let us quickly review the performance of the 
symmetric case. Suppose that k stations are contending for channel access. Each has a 
probability p of transmitting during each slot. The probability that some station successfully 
acquires the channel during a given slot is then kp(1 - p)K `t. To find the optimal value of p, we 
differentiate with respect to p, set the result to zero, and solve for p. Doing so, we find that 
the best value of p is 1/k. Substituting p — 1/k, we get 


Equation 4 


k-1 
Pr[success with optimal p] — E | 


This probability is plotted in Fig. 4-8. For small numbers of stations, the chances of success are 
good, but as soon as the number of stations reaches even five, the probability has dropped 
close to its asymptotic value of 1/e. 


Figure 4-8. Acquisition probability for a symmetric contention channel. 
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From Fig. 4-8, it is fairly obvious that the probability of some station acquiring the channel can 
be increased only by decreasing the amount of competition. The limited-contention protocols 
do precisely that. They first divide the stations into (not necessarily disjoint) groups. Only the 
members of group 0 are permitted to compete for slot 0. If one of them succeeds, it acquires 
the channel and transmits its frame. If the slot lies fallow or if there is a collision, the members 
of group 1 contend for slot 1, etc. By making an appropriate division of stations into groups, 
the amount of contention for each slot can be reduced, thus operating each slot near the left 


end of Fig. 4-8. 


The trick is how to assign stations to slots. Before looking at the general case, let us consider 
some special cases. At one extreme, each group has but one member. Such an assignment 
guarantees that there will never be collisions because at most one station is contending for any 
given slot. We have seen such protocols before (e.g., binary countdown). The next special case 
is to assign two stations per group. The probability that both will try to transmit during a slot is 
p?, which for small p is negligible. As more and more stations are assigned to the same slot, 
the probability of a collision grows, but the length of the bit-map scan needed to give everyone 
a chance shrinks. The limiting case is a single group containing all stations (i.e., slotted 
ALOHA). What we need is a way to assign stations to slots dynamically, with many stations per 
slot when the load is low and few (or even just one) station per slot when the load is high. 


The Adaptive Tree Walk Protocol 
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One particularly simple way of performing the necessary assignment is to use the algorithm 
devised by the U.S. Army for testing soldiers for syphilis during World War II (Dorfman, 1943). 
In short, the Army took a blood sample from N soldiers. A portion of each sample was poured 
into a single test tube. This mixed sample was then tested for antibodies. If none were found, 
all the soldiers in the group were declared healthy. If antibodies were present, two new mixed 
samples were prepared, one from soldiers 1 through /V/2 and one from the rest. The process 
was repeated recursively until the infected soldiers were determined. 


For the computerized version of this algorithm (Capetanakis, 1979), it is convenient to think of 
the stations as the leaves of a binary tree, as illustrated in Fig. 4-9. In the first contention slot 
following a successful frame transmission, slot 0, all stations are permitted to try to acquire 
the channel. If one of them does so, fine. If there is a collision, then during slot 1 only those 
stations falling under node 2 in the tree may compete. If one of them acquires the channel, 
the slot following the frame is reserved for those stations under node 3. If, on the other hand, 
two or more stations under node 2 want to transmit, there will be a collision during slot 1, in 
which case it is node 4's turn during slot 2. 


Figure 4-9. The tree for eight stations. 
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In essence, if a collision occurs during slot 0, the entire tree is searched, depth first, to locate 
all ready stations. Each bit slot is associated with some particular node in the tree. If a collision 
occurs, the search continues recursively with the node's left and right children. If a bit slot is 
idle or if only one station transmits in it, the searching of its node can stop because all ready 
stations have been located. (Were there more than one, there would have been a collision.) 


When the load on the system is heavy, it is hardly worth the effort to dedicate slot 0 to node 
1, because that makes sense only in the unlikely event that precisely one station has a frame 
to send. Similarly, one could argue that nodes 2 and 3 should be skipped as well for the same 
reason. Put in more general terms, at what level in the tree should the search begin? Clearly, 
the heavier the load, the farther down the tree the search should begin. We will assume that 


each station has a good estimate of the number of ready stations, q, for example, from 
monitoring recent traffic. 


To proceed, let us number the levels of the tree from the top, with node 1 in Fig. 4-9 at level 
0, nodes 2 and 3 at level 1, etc. Notice that each node at level į has a fraction 2° of the 
stations below it. If the q ready stations are uniformly distributed, the expected number of 
them below a specific node at level į is just 2g. Intuitively, we would expect the optimal level 
to begin searching the tree as the one at which the mean number of contending stations per 
slot is 1, that is, the level at which 2"q = 1. Solving this equation, we find that i = log» q. 


Numerous improvements to the basic algorithm have been discovered and are discussed in 

some detail by Bertsekas and Gallager (1992). For example, consider the case of stations G 
and H being the only ones wanting to transmit. At node 1 a collision will occur, so 2 will be 

tried and discovered idle. It is pointless to probe node 3 since it is guaranteed to have a 
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collision (we know that two or more stations under 1 are ready and none of them are under 2, 
so they must all be under 3). The probe of 3 can be skipped and 6 tried next. When this probe 
also turns up nothing, 7 can be skipped and node G tried next. 


4.1.5 Wavelength Division Multiple Access Protocols 


A different approach to channel allocation is to divide the channel into subchannels using FDM, 
TDM, or both, and dynamically allocate them as needed. Schemes like this are commonly used 
on fiber optic LANs to permit different conversations to use different wavelengths (i.e., 
frequencies) at the same time. In this section we will examine one such protocol (Humblet et 
al., 1992). 


A simple way to build an all-optical LAN is to use a passive star coupler (see Fig. 2-10). In 
effect, two fibers from each station are fused to a glass cylinder. One fiber is for output to the 
cylinder and one is for input from the cylinder. Light output by any station illuminates the 
cylinder and can be detected by all the other stations. Passive stars can handle hundreds of 
stations. 


To allow multiple transmissions at the same time, the spectrum is divided into channels 
(wavelength bands), as shown in Fig. 2-31. In this protocol, WDMA (Wavelength Division 
Multiple Access), each station is assigned two channels. A narrow channel is provided as a 
control channel to signal the station, and a wide channel is provided so the station can output 
data frames. 


Each channel is divided into groups of time slots, as shown in Fig. 4-10. Let us call the number 
of slots in the control channel m and the number of slots in the data channel n + 1, where n of 
these are for data and the last one is used by the station to report on its status (mainly, which 
slots on both channels are free). On both channels, the sequence of slots repeats endlessly, 
with slot O being marked in a special way so latecomers can detect it. All channels are 
synchronized by a single global clock. 


Figure 4-10. Wavelength division multiple access. 
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The protocol supports three traffic classes : (1) constant data rate connection-oriented traffic, 
such as uncompressed video, (2) variable data rate connection-oriented traffic, such as file 
transfer, and (3) datagram traffic, such as UDP packets. For the two connection-oriented 
protocols, the idea is that for A to communicate with B, it must first insert a CONNECTION 
REQUEST frame in a free slot on B's control channel. If B accepts, communication can take 
place on A's data channel. 
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Each station has two transmitters and two receivers, as follows: 


1. Afixed-wavelength receiver for listening to its own control channel. 
2. Atunable transmitter for sending on other stations' control channels. 
3. A fixed-wavelength transmitter for outputting data frames. 

4. Atunable receiver for selecting a data transmitter to listen to. 


In other words, every station listens to its own control channel for incoming requests but has 
to tune to the transmitter's wavelength to get the data. Wavelength tuning is done by a Fabry- 
Perot or Mach-Zehnder interferometer that filters out all wavelengths except the desired 
wavelength band. 


Let us now consider how station A sets up a class 2 communication channel with station B for, 
say, file transfer. First, A tunes its data receiver to B's data channel and waits for the status 
slot. This slot tells which control slots are currently assigned and which are free. In Fig. 4-10, 
for example, we see that of B's eight control slots, 0, 4, and 5 are free. The rest are occupied 
(indicated by crosses). 


A picks one of the free control slots, say, 4, and inserts its CONNECTION REQUEST message 
there. Since B constantly monitors its control channel, it sees the request and grants it by 
assigning slot 4 to A. This assignment is announced in the status slot of B's data channel. 
When A sees the announcement, it knows it has a unidirectional connection. If A asked for a 
two-way connection, B now repeats the same algorithm with A. 


It is possible that at the same time A tried to grab B's control slot 4, C did the same thing. 
Neither will get it, and both will notice the failure by monitoring the status slot in B's control 
channel. They now each wait a random amount of time and try again later. 


At this point, each party has a conflict-free way to send short control messages to the other 
one. To perform the file transfer, A now sends B a control message saying, for example, 
"Please watch my next data output slot 3. There is a data frame for you in it." When B gets 


the control message, it tunes its receiver to A's output channel to read the data frame. 
Depending on the higher-layer protocol, B can use the same mechanism to send back an 
acknowledgement if it wishes. 


Note that a problem arises if both A and C have connections to B and each of them suddenly 
tells B to look at slot 3. B will pick one of these requests at random, and the other 
transmission will be lost. 


For constant rate traffic, a variation of this protocol is used. When A asks for a connection, it 
simultaneously says something like: Is it all right if I send you a frame in every occurrence of 
slot 3? If B is able to accept (i.e., has no previous commitment for slot 3), a guaranteed 
bandwidth connection is established. If not, A can try again with a different proposal, 
depending on which output slots it has free. 


Class 3 (datagram) traffic uses still another variation. Instead of writing a CONNECTION 
REQUEST message into the control slot it just found (4), it writes a DATA FOR YOU IN SLOT 3 
message. If B is free during the next data slot 3, the transmission will succeed. Otherwise, the 
data frame is lost. In this manner, no connections are ever needed. 


Several variants of the protocol are possible. For example, instead of each station having its 
own control channel, a single control channel can be shared by all stations. Each station is 
assigned a block of slots in each group, effectively multiplexing multiple virtual channels onto 
one physical one. 
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It is also possible to make do with a single tunable transmitter and a single tunable receiver 
per station by having each station's channel be divided into m control slots followed by n + 1 
data slots. The disadvantage here is that senders have to wait longer to capture a control slot 
and consecutive data frames are farther apart because some control information is in the way. 


Numerous other WDMA protocols have been proposed and implemented, differing in various 
details. Some have only one control channel; others have multiple control channels. Some take 
propagation delay into account; others do not. Some make tuning time an explicit part of the 
model; others ignore it. The protocols also differ in terms of processing complexity, 
throughput, and scalability. When a large number of frequencies are being used, the system is 
sometimes called DWDM (Dense Wavelength Division Multiplexing). For more information 
see (Bogineni et al., 1993; Chen, 1994; Goralski, 2001; Kartalopoulos, 1999; and Levine and 
Akyildiz, 1995). 


4.1.6 Wireless LAN Protocols 


As the number of mobile computing and communication devices grows, so does the demand to 
connect them to the outside world. Even the very first mobile telephones had the ability to 
connect to other telephones. The first portable computers did not have this capability, but soon 
afterward, modems became commonplace on notebook computers. To go on-line, these 
computers had to be plugged into a telephone wall socket. Requiring a wired connection to the 
fixed network meant that the computers were portable, but not mobile. 


To achieve true mobility, notebook computers need to use radio (or infrared) signals for 
communication. In this manner, dedicated users can read and send e-mail while hiking or 
boating. A system of notebook computers that communicate by radio can be regarded as a 
wireless LAN, as we discussed in Sec. 1.5.4. These LANs have somewhat different properties 
than conventional LANs and require special MAC sublayer protocols. In this section we will 
examine some of these protocols. More information about wireless LANs can be found in 
(Geier, 2002; and O'Hara and Petrick, 1999). 


A common configuration for a wireless LAN is an office building with base stations (also called 

access points) strategically placed around the building. All the base stations are wired together 
using copper or fiber. If the transmission power of the base stations and notebooks is adjusted 
to have a range of 3 or 4 meters, then each room becomes a single cell and the entire building 
becomes a large cellular system, as in the traditional cellular telephony systems we studied in 

Chap. 2. Unlike cellular telephone systems, each cell has only one channel, covering the entire 
available bandwidth and covering all the stations in its cell. Typically, its bandwidth is 11 to 54 
Mbps. 


In our discussions below, we will make the simplifying assumption that all radio transmitters 
have some fixed range. When a receiver is within range of two active transmitters, the 
resulting signal will generally be garbled and useless, in other words, we will not consider 
CDMA-type systems further in this discussion. It is important to realize that in some wireless 
LANs, not all stations are within range of one another, which leads to a variety of 
complications. Furthermore, for indoor wireless LANs, the presence of walls between stations 
can have a major impact on the effective range of each station. 


A naive approach to using a wireless LAN might be to try CSMA: just listen for other 
transmissions and only transmit if no one else is doing so. The trouble is, this protocol is not 
really appropriate because what matters is interference at the receiver, not at the sender. To 
see the nature of the problem, consider Fig. 4-11, where four wireless stations are illustrated. 
For our purposes, it does not matter which are base stations and which are notebooks. The 
radio range is such that A and B are within each other's range and can potentially interfere 
with one another. C can also potentially interfere with both B and D, but not with A. 
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Figure 4-11. A wireless LAN. (a) A transmitting. (b) B transmitting. 
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First consider what happens when A is transmitting to B, as depicted in Fig. 4-11(a). If C 
senses the medium, it will not hear A because A is out of range, and thus falsely conclude that 
it can transmit to B. If C does start transmitting, it will interfere at B, wiping out the frame 
from A. The problem of a station not being able to detect a potential competitor for the 
medium because the competitor is too far away is called the hidden station problem. 


Now let us consider the reverse situation: B transmitting to A, as shown in Fig. 4-11(b). If C 
senses the medium, it will hear an ongoing transmission and falsely conclude that it may not 
send to D, when in fact such a transmission would cause bad reception only in the zone 
between B and C, where neither of the intended receivers is located. This is called the 
exposed station problem. 


The problem is that before starting a transmission, a station really wants to know whether 
there is activity around the receiver. CSMA merely tells it whether there is activity around the 
station sensing the carrier. With a wire, all signals propagate to all stations so only one 
transmission can take place at once anywhere in the system. In a system based on short- 
range radio waves, multiple transmissions can occur simultaneously if they all have different 
destinations and these destinations are out of range of one another. 


Another way to think about this problem is to imagine an office building in which every 
employee has a wireless notebook computer. Suppose that Linda wants to send a message to 
Milton. Linda's computer senses the local environment and, detecting no activity, starts 
sending. However, there may still be a collision in Milton's office because a third party may 


currently be sending to him from a location so far from Linda that her computer could not 
detect it. 


MACA and MACAW 


An early protocol designed for wireless LANs is MACA (Multiple Access with Collision 
Avoidance) (Karn, 1990). The basic idea behind it is for the sender to stimulate the receiver 
into outputting a short frame, so stations nearby can detect this transmission and avoid 
transmitting for the duration of the upcoming (large) data frame. MACA is illustrated in Fig. 4- 
12. 


Figure 4-12. The MACA protocol. (a) A sending an RTS to B. (b) B 
responding with a CTS to A. 
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Let us now consider how A sends a frame to B. A starts by sending an RTS (Request To 
Send) frame to B, as shown in Fig. 4-12(a). This short frame (30 bytes) contains the length of 
the data frame that will eventually follow. Then B replies with a CTS (Clear to Send) frame, 
as shown in Fig. 4-12(b). The CTS frame contains the data length (copied from the RTS 
frame). Upon receipt of the CTS frame, A begins transmission. 


Now let us see how stations overhearing either of these frames react. Any station hearing the 
RTS is clearly close to A and must remain silent long enough for the CTS to be transmitted 
back to A without conflict. Any station hearing the CTS is clearly close to B and must remain 
silent during the upcoming data transmission, whose length it can tell by examining the CTS 
frame. 


In Fig. 4-12, C is within range of A but not within range of B. Therefore, it hears the RTS from 
A but not the CTS from B. As long as it does not interfere with the CTS, it is free to transmit 
while the data frame is being sent. In contrast, D is within range of B but not A. It does not 
hear the RTS but does hear the CTS. Hearing the CTS tips it off that it is close to a station that 
is about to receive a frame, so it defers sending anything until that frame is expected to be 
finished. Station E hears both control messages and, like D, must be silent until the data frame 
is complete. 


Despite these precautions, collisions can still occur. For example, B and C could both send RTS 
frames to A at the same time. These will collide and be lost. In the event of a collision, an 
unsuccessful transmitter (i.e., one that does not hear a CTS within the expected time interval) 
waits a random amount of time and tries again later. The algorithm used is binary exponential 
backoff, which we will study when we come to Ethernet. 


Based on simulation studies of MACA, Bharghavan et al. (1994) fine tuned MACA to improve its 
performance and renamed their new protocol MACAW (MACA for Wireless). To start with, 


they noticed that without data link layer acknowledgements, lost frames were not 
retransmitted until the transport layer noticed their absence, much later. They solved this 
problem by introducing an ACK frame after each successful data frame. They also observed 
that CSMA has some use, namely, to keep a station from transmitting an RTS at the same 
time another nearby station is also doing so to the same destination, so carrier sensing was 
added. In addition, they decided to run the backoff algorithm separately for each data stream 
(source-destination pair), rather than for each station. This change improves the fairness of 
the protocol. Finally, they added a mechanism for stations to exchange information about 
congestion and a way to make the backoff algorithm react less violently to temporary 
problems, to improve system performance. 


4.2 Ethernet 


We have now finished our general discussion of channel allocation protocols in the abstract, so 
it is time to see how these principles apply to real systems, in particular, LANs. As discussed in 
Sec. 1.5.3, the IEEE has standardized a number of local area networks and metropolitan area 
networks under the name of IEEE 802. A few have survived but many have not, as we saw in 
Fig. 1-38. Some people who believe in reincarnation think that Charles Darwin came back as a 
member of the IEEE Standards Association to weed out the unfit. The most important of the 
survivors are 802.3 (Ethernet) and 802.11 (wireless LAN). With 802.15 (Bluetooth) and 
802.16 (wireless MAN), it is too early to tell. Please consult the 5th edition of this book to find 
out. Both 802.3 and 802.11 have different physical layers and different MAC sublayers but 
converge on the same logical link control sublayer (defined in 802.2), so they have the same 
interface to the network layer. 


We introduced Ethernet in Sec. 1.5.3 and will not repeat that material here. Instead we will 
focus on the technical details of Ethernet, the protocols, and recent developments in high- 
speed (gigabit) Ethernet. Since Ethernet and IEEE 802.3 are identical except for two minor 
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differences that we will discuss shortly, many people use the terms "Ethernet" and "IEEE 
802.3" interchangeably, and we will do so, too. For more information about Ethernet, see 
(Breyer and Riley, 1999 ; Seifert, 1998; and Spurgeon, 2000). 


4.2.1 Ethernet Cabling 


Since the name "Ethernet" refers to the cable (the ether), let us start our discussion there. 
Four types of cabling are commonly used, as shown in Fig. 4-13. 


Figure 4-13. The most common kinds of Ethernet cabling. 


Name Cable , Max. seg. Nodes/seg. Advantages 
10Base5 | Thick coax 500 m 100 Original cable; now obsolete 
10Base2 | Thin coax | 185 m | 30 | No hub needed 
10Base-T | Twisted pair 100m | 1024 | Cheapest system 
10Base-F | Fiber optics 2000 m 1024 Best between buildings 


Historically, 10Base5 cabling, popularly called thick Ethernet, came first. It resembles a 
yellow garden hose, with markings every 2.5 meters to show where the taps go. (The 802.3 
standard does not actually require the cable to be yellow, but it does suggest it.) Connections 
to it are generally made using vampire taps, in which a pin is very carefully forced halfway 
into the coaxial cable's core. The notation 10Base5 means that it operates at 10 Mbps, uses 
baseband signaling, and can support segments of up to 500 meters. The first number is the 
speed in Mbps. Then comes the word "Base" (or sometimes "BASE") to indicate baseband 
transmission. There used to be a broadband variant, 10Broad36, but it never caught on in the 
marketplace and has since vanished. Finally, if the medium is coax, its length is given rounded 
to units of 100 m after "Base." 


Historically, the second cable type was 10Base2, or thin Ethernet, which, in contrast to the 
garden-hose-like thick Ethernet, bends easily. Connections to it are made using industry- 
standard BNC connectors to form T junctions, rather than using vampire taps. BNC connectors 
are easier to use and more reliable. Thin Ethernet is much cheaper and easier to install, but it 
can run for only 185 meters per segment, each of which can handle only 30 machines. 


Detecting cable breaks, excessive length, bad taps, or loose connectors can be a major 
problem with both media. For this reason, techniques have been developed to track them 
down. Basically, a pulse of known shape is injected into the cable. If the pulse hits an obstacle 
or the end of the cable, an echo will be generated and sent back. By carefully timing the 
interval between sending the pulse and receiving the echo, it is possible to localize the origin of 
the echo. This technique is called time domain reflectometry. 


The problems associated with finding cable breaks drove systems toward a different kind of 
wiring pattern, in which all stations have a cable running to a central hub in which they are all 
connected electrically (as if they were soldered together). Usually, these wires are telephone 
company twisted pairs, since most office buildings are already wired this way, and normally 
plenty of spare pairs are available. This scheme is called 10Base-T. Hubs do not buffer 
incoming traffic. We will discuss an improved version of this idea (switches), which do buffer 
incoming traffic later in this chapter. 


These three wiring schemes are illustrated in Fig. 4-14. For 10Base5, a transceiver is 
clamped securely around the cable so that its tap makes contact with the inner core. The 
transceiver contains the electronics that handle carrier detection and collision detection. When 
a collision is detected, the transceiver also puts a special invalid signal on the cable to ensure 
that all other transceivers also realize that a collision has occurred. 
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Figure 4-14. Three kinds of Ethernet cabling. (a) 10Base5. (b) 
10Base2. (c) 10Base-T. 


Transceiver Connector 
(a) (b) 


With 10Base5, a transceiver cable or drop cable connects the transceiver to an interface 
board in the computer. The transceiver cable may be up to 50 meters long and contains five 
individually shielded twisted pairs. Two of the pairs are for data in and data out, respectively. 
Two more are for control signals in and out. The fifth pair, which is not always used, allows the 
computer to power the transceiver electronics. Some transceivers allow up to eight nearby 
computers to be attached to them, to reduce the number of transceivers needed. 


The transceiver cable terminates on an interface board inside the computer. The interface 
board contains a controller chip that transmits frames to, and receives frames from, the 
transceiver. The controller is responsible for assembling the data into the proper frame format, 
as well as computing checksums on outgoing frames and verifying them on incoming frames. 


Some controller chips also manage a pool of buffers for incoming frames, a queue of buffers to 
be transmitted, direct memory transfers with the host computers, and other aspects of 
network management. 


With 10Base2, the connection to the cable is just a passive BNC T-junction connector. The 
transceiver electronics are on the controller board, and each station always has its own 
transceiver. 


With 10Base-T, there is no shared cable at all, just the hub (a box full of electronics) to which 
each station is connected by a dedicated (i.e., not shared) cable. Adding or removing a station 
is simpler in this configuration, and cable breaks can be detected easily. The disadvantage of 

10Base-T is that the maximum cable run from the hub is only 100 meters, maybe 200 meters 
if very high quality category 5 twisted pairs are used. Nevertheless, 10Base-T quickly became 
dominant due to its use of existing wiring and the ease of maintenance that it offers. A faster 

version of 10Base-T (100Base-T) will be discussed later in this chapter. 


A fourth cabling option for Ethernet is 10Base-F, which uses fiber optics. This alternative is 
expensive due to the cost of the connectors and terminators, but it has excellent noise 
immunity and is the method of choice when running between buildings or widely-separated 
hubs. Runs of up to km are allowed. It also offers good security since wiretapping fiber is much 
more difficult than wiretapping copper wire. 


Figure 4-15 shows different ways of wiring a building. In Fig. 4-15(a), a single cable is snaked 
from room to room, with each station tapping into it at the nearest point. In Fig. 4-15(b), a 
vertical spine runs from the basement to the roof, with horizontal cables on each floor 
connected to the spine by special amplifiers (repeaters). In some buildings, the horizontal 
cables are thin and the backbone is thick. The most general topology is the tree, as in Fig. 4- 
15(c), because a network with two paths between some pairs of stations would suffer from 
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interference between the two signals. 


Figure 4-15. Cable topologies. (a) Linear. (b) Spine. (c) Tree. (d) 
Segmented. 
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Each version of Ethernet has a maximum cable length per segment. To allow larger networks, 
multiple cables can be connected by repeaters, as shown in Fig. 4-15(d). A repeater is a 
physical layer device. It receives, amplifies (regenerates), and retransmits signals in both 
directions. As far as the software is concerned, a series of cable segments connected by 
repeaters is no different from a single cable (except for some delay introduced by the 
repeaters). A system may contain multiple cable segments and multiple repeaters, but no two 
transceivers may be more than 2.5 km apart and no path between any two transceivers may 
traverse more than four repeaters. 


4.2.2 Manchester Encoding 


None of the versions of Ethernet uses straight binary encoding with 0 volts for a O bit and 5 
volts for a 1 bit because it leads to ambiguities. If one station sends the bit string 0001000, 
others might falsely interpret it as 10000000 or 01000000 because they cannot tell the 
difference between an idle sender (0 volts) and a 0 bit (0 volts). This problem can be solved by 
using +1 volts for a 1 and -1 volts for a O, but there is still the problem of a receiver sampling 
the signal at a slightly different frequency than the sender used to generate it. Different clock 
speeds can cause the receiver and sender to get out of synchronization about where the bit 
boundaries are, especially after a long run of consecutive Os or a long run of consecutive 1s. 


What is needed is a way for receivers to unambiguously determine the start, end, or middle of 
each bit without reference to an external clock. Two such approaches are called Manchester 
encoding and differential Manchester encoding. With Manchester encoding, each bit 
period is divided into two equal intervals. A binary 1 bit is sent by having the voltage set high 
during the first interval and low in the second one. A binary O is just the reverse: first low and 
then high. This scheme ensures that every bit period has a transition in the middle, making it 
easy for the receiver to synchronize with the sender. A disadvantage of Manchester encoding is 
that it requires twice as much bandwidth as straight binary encoding because the pulses are 
half the width. For example, to send data at 10 Mbps, the signal has to change 20 million 
times/sec. Manchester encoding is shown in Fig. 4-16(b). 


Figure 4-16. (a) Binary encoding. (b) Manchester encoding. (c) 
Differential Manchester encoding. 
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Differential Manchester encoding, shown in Fig. 4-16(c), is a variation of basic Manchester 
encoding. In it, a 1 bit is indicated by the absence of a transition at the start of the interval. A 
O bit is indicated by the presence of a transition at the start of the interval. In both cases, 
there is a transition in the middle as well. The differential scheme requires more complex 
equipment but offers better noise immunity. All Ethernet systems use Manchester encoding 
due to its simplicity. The high signal is + 0.85 volts and the low signal is - 0.85 volts, giving a 
DC value of 0 volts. Ethernet does not use differential Manchester encoding, but other LANs 
(e.g., the 802.5 token ring) do use it. 


4.2.3 The Ethernet MAC Sublayer Protocol 


The original DIX (DEC, Intel, Xerox) frame structure is shown in Fig. 4-17(a). Each frame 
starts with a Preamble of 8 bytes, each containing the bit pattern 10101010. The Manchester 
encoding of this pattern produces a 10-MHz square wave for 6.4 usec to allow the receiver's 
clock to synchronize with the sender's. They are required to stay synchronized for the rest of 
the frame, using the Manchester encoding to keep track of the bit boundaries. 


Figure 4-17. Frame formats. (a ) DIX Ethernet. (b) IEEE 802.3. 
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The frame contains two addresses, one for the destination and one for the source. The 
standard allows 2-byte and 6-byte addresses, but the parameters defined for the 10-Mbps 
baseband standard use only the 6-byte addresses. The high-order bit of the destination 
address is a O for ordinary addresses and 1 for group addresses. Group addresses allow 
multiple stations to listen to a single address. When a frame is sent to a group address, all the 
stations in the group receive it. Sending to a group of stations is called multicast. The address 
consisting of all 1 bits is reserved for broadcast. A frame containing all 1s in the destination 
field is accepted by all stations on the network. The difference between multicast and 
broadcast is important enough to warrant repeating. A multicast frame is sent to a selected 
group of stations on the Ethernet; a broadcast frame is sent to all stations on the Ethernet. 
Multicast is more selective, but involves group management. Broadcasting is coarser but does 
not require any group management. 


Another interesting feature of the addressing is the use of bit 46 (adjacent to the high-order 
bit) to distinguish local from global addresses. Local addresses are assigned by each network 
administrator and have no significance outside the local network. Global addresses, in 
contrast, are assigned centrally by IEEE to ensure that no two stations anywhere in the world 
have the same global address. With 48 - 2 = 46 bits available, there are about 7 x 10?? global 
addresses. The idea is that any station can uniquely address any other station by just giving 
the right 48-bit number. It is up to the network layer to figure out how to locate the 
destination. 


Next comes the Type field, which tells the receiver what to do with the frame. Multiple 
network-layer protocols may be in use at the same time on the same machine, so when an 
Ethernet frame arrives, the kernel has to know which one to hand the frame to. The Type field 
specifies which process to give the frame to. 


Next come the data, up to 1500 bytes. This limit was chosen somewhat arbitrarily at the time 
the DIX standard was cast in stone, mostly based on the fact that a transceiver needs enough 
RAM to hold an entire frame and RAM was expensive in 1978. A larger upper limit would have 
meant more RAM, hence a more expensive transceiver. 
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In addition to there being a maximum frame length, there is also a minimum frame length. 
While a data field of O bytes is sometimes useful, it causes a problem. When a transceiver 
detects a collision, it truncates the current frame, which means that stray bits and pieces of 
frames appear on the cable all the time. To make it easier to distinguish valid frames from 
garbage, Ethernet requires that valid frames must be at least 64 bytes long, from destination 
address to checksum, including both. If the data portion of a frame is less than 46 bytes, the 
Pad field is used to fill out the frame to the minimum size. 


Another (and more important) reason for having a minimum length frame is to prevent a 
station from completing the transmission of a short frame before the first bit has even reached 
the far end of the cable, where it may collide with another frame. This problem is illustrated in 
Fig. 4-18. At time O, station A, at one end of the network, sends off a frame. Let us call the 
propagation time for this frame to reach the other end «. Just before the frame gets to the 
other end (i.e., at time t-e), the most distant station, B, starts transmitting. When B detects 
that it is receiving more power than it is putting out, it knows that a collision has occurred, so 
it aborts its transmission and generates a 48-bit noise burst to warn all other stations. In other 
words, it jams the ether to make sure the sender does not miss the collision. At about time 2, 


the sender sees the noise burst and aborts its transmission, too. It then waits a random time 
before trying again. 


Figure 4-18. Collision detection can take as long as 2:. 
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If a station tries to transmit a very short frame, it is conceivable that a collision occurs, but the 
transmission completes before the noise burst gets back at 2t. The sender will then incorrectly 
conclude that the frame was successfully sent. To prevent this situation from occurring, all 
frames must take more than 2: to send so that the transmission is still taking place when the 
noise burst gets back to the sender. For a 10-Mbps LAN with a maximum length of 2500 
meters and four repeaters (from the 802.3 specification), the round-trip time (including time 
to propagate through the four repeaters) has been determined to be nearly 50 usec in the 
worst case, including the time to pass through the repeaters, which is most certainly not zero. 
Therefore, the minimum frame must take at least this long to transmit. At 10 Mbps, a bit takes 
100 nsec, so 500 bits is the smallest frame that is guaranteed to work. To add some margin of 
safety, this number was rounded up to 512 bits or 64 bytes. Frames with fewer than 64 bytes 
are padded out to 64 bytes with the Pad field. 


As the network speed goes up, the minimum frame length must go up or the maximum cable 
length must come down, proportionally. For a 2500-meter LAN operating at 1 Gbps, the 
minimum frame size would have to be 6400 bytes. Alternatively, the minimum frame size 
could be 640 bytes and the maximum distance between any two stations 250 meters. These 
restrictions are becoming increasingly painful as we move toward multigigabit networks. 


The final Ethernet field is the Checksum. It is effectively a 32-bit hash code of the data. If 
some data bits are erroneously received (due to noise on the cable), the checksum will almost 
certainly be wrong and the error will be detected. The checksum algorithm is a cyclic 
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redundancy check (CRC) of the kind discussed in Chap. 3. It just does error detection, not 
forward error correction. 


When IEEE standardized Ethernet, the committee made two changes to the DIX format, as 
shown in Fig. 4-17(b). The first one was to reduce the preamble to 7 bytes and use the last 
byte for a Start of Frame delimiter, for compatibility with 802.4 and 802.5. The second one 
was to change the Type field into a Length field. Of course, now there was no way for the 
receiver to figure out what to do with an incoming frame, but that problem was handled by the 
addition of a small header to the data portion itself to provide this information. We will discuss 
the format of the data portion when we come to logical link control later in this chapter. 


Unfortunately, by the time 802.3 was published, so much hardware and software for DIX 
Ethernet was already in use that few manufacturers and users were enthusiastic about 
converting the Type field into a Length field. In 1997 IEEE threw in the towel and said that 
both ways were fine with it. Fortunately, all the Type fields in use before 1997 were greater 
than 1500. Consequently, any number there less than or equal to 1500 can be interpreted as 
Length, and any number greater than 1500 can be interpreted as Type. Now IEEE can 


maintain that everyone is using its standard and everybody else can keep on doing what they 
were already doing without feeling guilty about it. 


4.2.4 The Binary Exponential Backoff Algorithm 


Let us now see how randomization is done when a collision occurs. The model is that of Fig. 4- 
5. After a collision, time is divided into discrete slots whose length is equal to the worst-case 
round-trip propagation time on the ether (2x). To accommodate the longest path allowed by 
Ethernet, the slot time has been set to 512 bit times, or 51.2 usec as mentioned above. 


After the first collision, each station waits either 0 or 1 slot times before trying again. If two 
stations collide and each one picks the same random number, they will collide again. After the 
second collision, each one picks either 0, 1, 2, or 3 at random and waits that number of slot 
times. If a third collision occurs (the probability of this happening is 0.25), then the next time 
the number of slots to wait is chosen at random from the interval 0 to 2? - 1. 


In general, after / collisions, a random number between 0 and 2’ - 1 is chosen, and that 
number of slots is skipped. However, after ten collisions have been reached, the randomization 
interval is frozen at a maximum of 1023 slots. After 16 collisions, the controller throws in the 
towel and reports failure back to the computer. Further recovery is up to higher layers. 


This algorithm, called binary exponential backoff, was chosen to dynamically adapt to the 
number of stations trying to send. If the randomization interval for all collisions was 1023, the 
chance of two stations colliding for a second time would be negligible, but the average wait 
after a collision would be hundreds of slot times, introducing significant delay. On the other 
hand, if each station always delayed for either zero or one slots, then if 100 stations ever tried 
to send at once, they would collide over and over until 99 of them picked 1 and the remaining 
station picked O. This might take years. By having the randomization interval grow 
exponentially as more and more consecutive collisions occur, the algorithm ensures a low 
delay when only a few stations collide but also ensures that the collision is resolved in a 
reasonable interval when many stations collide. Truncating the backoff at 1023 keeps the 
bound from growing too large. 


As described so far, CSMA/CD provides no acknowledgements. Since the mere absence of 
collisions does not guarantee that bits were not garbled by noise spikes on the cable, for 
reliable communication the destination must verify the checksum, and if correct, send back an 
acknowledgement frame to the source. Normally, this acknowledgement would be just another 
frame as far as the protocol is concerned and would have to fight for channel time just like a 
data frame. However, a simple modification to the contention algorithm would allow speedy 
confirmation of frame receipt (Tokoro and Tamaru, 1977). All that would be needed is to 
reserve the first contention slot following successful transmission for the destination station. 
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Unfortunately, the standard does not provide for this possibility. 


4.2.5 Ethernet Performance 


Now let us briefly examine the performance of Ethernet under conditions of heavy and 
constant load, that is, k stations always ready to transmit. A rigorous analysis of the binary 
exponential backoff algorithm is complicated. Instead, we will follow Metcalfe and Boggs 
(1976) and assume a constant retransmission probability in each slot. If each station transmits 
during a contention slot with probability p, the probability A that some station acquires the 
channel in that slot is 


Equation 4 


A - kp(1 - p^! — oo 

A is maximized when p = 1/k, with A —3?1/e as k . The probability that the contention 
interval has exactly j slots in it is A(1 - AY ~1, so the mean number of slots per contention is 
given by 


Y.jA(4 - Ay"! = l 
A 


j=0 


Since each slot has a duration 2t, the mean contention interval, w, is 2z/A. Assuming optimal 
p, the mean number of contention slots is never more than e, so w is at most 2te 5.41. 


If the mean frame takes P sec to transmit, when many stations have frames to send, 


Equation 4 


P 


Channel efficiency = ————— 
annel efficiency 257777. 


Here we see where the maximum cable distance between any two stations enters into the 
performance figures, giving rise to topologies other than that of Fig. 4-15(a). The longer the 
cable, the longer the contention interval. This observation is why the Ethernet standard 
specifies a maximum cable length. 


It is instructive to formulate Eq. (4-6) in terms of the frame length, F, the network bandwidth, 
B, the cable length, L, and the speed of signal propagation, c, for the optimal case of e 
contention slots per frame. With P — F/B, Eq. (4-6) becomes 


Equation 4 


l 


Channel efficiency = l«2BL AF 
2 P/Cr 


When the second term in the denominator is large, network efficiency will be low. More 
specifically, increasing network bandwidth or distance (the BL product) reduces efficiency for a 
given frame size. Unfortunately, much research on network hardware is aimed precisely at 
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increasing this product. People want high bandwidth over long distances (fiber optic MANs, for 
example), which suggests that Ethernet implemented in this manner may not be the best 
system for these applications. We will see other ways of implementing Ethernet when we come 
to switched Ethernet later in this chapter. 


In Fig. 4-19, the channel efficiency is plotted versus number of ready stations for 2t=51.2 usec 
and a data rate of 10 Mbps, using Eq. (4-7). With a 64-byte slot time, it is not surprising that 
64-byte frames are not efficient. On the other hand, with 1024-byte frames and an asymptotic 
value of e 64-byte slots per contention interval, the contention period is 174 bytes long and 
the efficiency is 0.85. 


Figure 4-19. Efficiency of Ethernet at 10 Mbps with 512-bit slot times. 
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To determine the mean number of stations ready to transmit under conditions of high load, we 
can use the following (crude) observation. Each frame ties up the channel for one contention 
period and one frame transmission time, for a total of P + w sec. The number of frames per 
second is therefore 1/(P + w). If each station generates frames at a mean rate of à 
frames/sec, then when the system is in state k, the total input rate of all unblocked stations 
combined is ka frames/sec. Since in equilibrium the input and output rates must be identical, 
we can equate these two expressions and solve for k. (Notice that w is a function of k.) A more 
sophisticated analysis is given in (Bertsekas and Gallager, 1992). 


It is probably worth mentioning that there has been a large amount of theoretical performance 
analysis of Ethernet (and other networks). Virtually all of this work has assumed that traffic is 
Poisson. As researchers have begun looking at real data, it now appears that network traffic is 
rarely Poisson, but self-similar (Paxson and Floyd, 1994; and Willinger et al., 1995). What this 
means is that averaging over long periods of time does not smooth out the traffic. The average 
number of frames in each minute of an hour has as much variance as the average number of 
frames in each second of a minute. The consequence of this discovery is that most models of 
network traffic do not apply to the real world and should be taken with a grain (or better yet, a 
metric ton) of salt. 


4.2.6 Switched Ethernet 


As more and more stations are added to an Ethernet, the traffic will go up. Eventually, the LAN 
will saturate. One way out is to go to a higher speed, say, from 10 Mbps to 100 Mbps. But with 
the growth of multimedia, even a 100-Mbps or 1-Gbps Ethernet can become saturated. 


Fortunately, there is an additional way to deal with increased load: switched Ethernet, as 
shown in Fig. 4-20. The heart of this system is a switch containing a high-speed backplane 
and room for typically 4 to 32 plug-in line cards, each containing one to eight connectors. Most 
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often, each connector has a 10Base-T twisted pair connection to a single host computer. 


Figure 4-20. A simple example of switched Ethernet. 
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When a station wants to transmit an Ethernet frame, it outputs a standard frame to the switch. 
The plug-in card getting the frame may check to see if it is destined for one of the other 
stations connected to the same card. If so, the frame is copied there. If not, the frame is sent 
over the high-speed backplane to the destination station's card. The backplane typically runs 
at many Gbps, using a proprietary protocol. 


What happens if two machines attached to the same plug-in card transmit frames at the same 
time? It depends on how the card has been constructed. One possibility is for all the ports on 
the card to be wired together to form a local on-card LAN. Collisions on this on-card LAN will 
be detected and handled the same as any other collisions on a CSMA/CD network—with 
retransmissions using the binary exponential backoff algorithm. With this kind of plug-in card, 
only one transmission per card is possible at any instant, but all the cards can be transmitting 
in parallel. With this design, each card forms its own collision domain, independent of the 
others. With only one station per collision domain, collisions are impossible and performance is 
improved. 


With the other kind of plug-in card, each input port is buffered, so incoming frames are stored 
in the card's on-board RAM as they arrive. This design allows all input ports to receive (and 
transmit) frames at the same time, for parallel, full-duplex operation, something not possible 
with CSMA/CD on a single channel. Once a frame has been completely received, the card can 
then check to see if the frame is destined for another port on the same card or for a distant 
port. In the former case, it can be transmitted directly to the destination. In the latter case, it 
must be transmitted over the backplane to the proper card. With this design, each port is a 
separate collision domain, so collisions do not occur. The total system throughput can often be 
increased by an order of magnitude over 10Base5, which has a single collision domain for the 
entire system. 


Since the switch just expects standard Ethernet frames on each input port, it is possible to use 
some of the ports as concentrators. In Fig. 4-20, the port in the upper-right corner is 
connected not to a single station, but to a 12-port hub. As frames arrive at the hub, they 
contend for the ether in the usual way, including collisions and binary backoff. Successful 
frames make it to the switch and are treated there like any other incoming frames: they are 
switched to the correct output line over the high-speed backplane. Hubs are cheaper than 
switches, but due to falling switch prices, they are rapidly becoming obsolete. Nevertheless, 
legacy hubs still exist. 


4.2.7 Fast Ethernet 


At first, 10 Mbps seemed like heaven, just as 1200-bps modems seemed like heaven to the 

early users of 300-bps acoustic modems. But the novelty wore off quickly. As a kind of 

corollary to Parkinson's Law ("Work expands to fill the time available for its completion"), it 
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seemed that data expanded to fill the bandwidth available for their transmission. To pump up 
the speed, various industry groups proposed two new ring-based optical LANs. One was called 


FDDI (Fiber Distributed Data Interface) and the other was called Fibre Channel an To 
make a long story short, while both were used as backbone networks, neither one made the 
breakthrough to the desktop. In both cases, the station management was too complicated, 
which led to complex chips and high prices. The lesson that should have been learned here 
was KISS (Keep It Simple, Stupid). 


T 


L"1Ttis called "fibre channel" and not "fiber channel" because the document editor was British. 


In any event, the failure of the optical LANs to catch fire left a gap for garden-variety Ethernet 
at speeds above 10 Mbps. Many installations needed more bandwidth and thus had numerous 

10-Mbps LANs connected by a maze of repeaters, bridges, routers, and gateways, although to 

the network managers it sometimes felt that they were being held together by bubble gum and 
chicken wire. 


It was in this environment that IEEE reconvened the 802.3 committee in 1992 with instructions 
to come up with a faster LAN. One proposal was to keep 802.3 exactly as it was, but just make 
it go faster. Another proposal was to redo it totally to give it lots of new features, such as real- 
time traffic and digitized voice, but just keep the old name (for marketing reasons). After some 
wrangling, the committee decided to keep 802.3 the way it was, but just make it go faster. 
The people behind the losing proposal did what any computer-industry people would have 

done under these circumstances—they stomped off and formed their own committee and 
standardized their LAN anyway (eventually as 802.12). It flopped miserably. 


The 802.3 committee decided to go with a souped-up Ethernet for three primary reasons: 


1. The need to be backward compatible with existing Ethernet LANs. 
2. The fear that a new protocol might have unforeseen problems. 
3. The desire to get the job done before the technology changed. 


The work was done quickly (by standards committees' norms), and the result, 802.3u, was 
officially approved by IEEE in June 1995. Technically, 802.3u is not a new standard, but an 
addendum to the existing 802.3 standard (to emphasize its backward compatibility). Since 
practically everyone calls it fast Ethernet, rather than 802.3u, we will do that, too. 


The basic idea behind fast Ethernet was simple: keep all the old frame formats, interfaces, and 
procedural rules, but just reduce the bit time from 100 nsec to 10 nsec. Technically, it would 
have been possible to copy either 10Base-5 or 10Base-2 and still detect collisions on time by 
just reducing the maximum cable length by a factor of ten. However, the advantages of 
10Base-T wiring were so overwhelming that fast Ethernet is based entirely on this design. 
Thus, all fast Ethernet systems use hubs and switches; multidrop cables with vampire taps or 
BNC connectors are not permitted. 


Nevertheless, some choices still had to be made, the most important being which wire types to 
support. One contender was category 3 twisted pair. The argument for it was that practically 
every office in the Western world has at least four category 3 (or better) twisted pairs running 
from it to a telephone wiring closet within 100 meters. Sometimes two such cables exist. Thus, 
using category 3 twisted pair would make it possible to wire up desktop computers using fast 
Ethernet without having to rewire the building, an enormous advantage for many 
organizations. 


The main disadvantage of category 3 twisted pair is its inability to carry 200 megabaud signals 
(100 Mbps with Manchester encoding) 100 meters, the maximum computer-to-hub distance 
specified for 10Base-T (see Fig. 4-13). In contrast, category 5 twisted pair wiring can handle 
100 meters easily, and fiber can go much farther. The compromise chosen was to allow all 
three possibilities, as shown in Fig. 4-21, but to pep up the category 3 solution to give it the 
additional carrying capacity needed. 

171 


Figure 4-21. The original fast Ethernet cabling. 


Name Cable , Max. segment — Advantages 
100Base-T4 | Twisted pair | 100 m | Uses category 3 UTP 
100Base-TX | Twisted pair | 100 m | Full duplex at 100 Mbps (Cat 5 UTP) 
100Base-FX | Fiber optics 2000 m Full duplex at 100 Mbps; long runs 


The category 3 UTP scheme, called 100Base-T4, uses a signaling speed of 25 MHz, only 25 
percent faster than standard Ethernet's 20 MHz (remember that Manchester encoding, as 
shown in Fig. 4-16, requires two clock periods for each of the 10 million bits each second). 
However, to achieve the necessary bandwidth, 100Base-T4 requires four twisted pairs. Since 
standard telephone wiring for decades has had four twisted pairs per cable, most offices are 
able to handle this. Of course, it means giving up your office telephone, but that is surely a 
small price to pay for faster e-mail. 


Of the four twisted pairs, one is always to the hub, one is always from the hub, and the other 
two are switchable to the current transmission direction. To get the necessary bandwidth, 
Manchester encoding is not used, but with modern clocks and such short distances, it is no 
longer needed. In addition, ternary signals are sent, so that during a single clock period the 
wire can contain a 0, a 1, or a 2. With three twisted pairs going in the forward direction and 
ternary signaling, any one of 27 possible symbols can be transmitted, making it possible to 
send 4 bits with some redundancy. Transmitting 4 bits in each of the 25 million clock cycles 
per second gives the necessary 100 Mbps. In addition, there is always a 33.3-Mbps reverse 
channel using the remaining twisted pair. This scheme, known as 8B/6T (8 bits map to 6 
trits), is not likely to win any prizes for elegance, but it works with the existing wiring plant. 


For category 5 wiring, the design, 100Base-TX, is simpler because the wires can handle clock 
rates of 125 MHz. Only two twisted pairs per station are used, one to the hub and one from it. 
Straight binary coding is not used; instead a scheme called used4B/ 5Bis It is taken from FDDI 
and compatible with it. Every group of five clock periods, each containing one of two signal 
values, yields 32 combinations. Sixteen of these combinations are used to transmit the four bit 
groups 0000, 0001, 0010, ..., 1111. Some of the remaining 16 are used for control purposes 
such as marking frames boundaries. The combinations used have been carefully chosen to 
provide enough transitions to maintain clock synchronization. The 100Base-TX system is full 
duplex; stations can transmit at 100 Mbps and receive at 100 Mbps at the same time. Often 
100Base-TX and 100Base-T4 are collectively referred to as 100Base-T. 


The last option, 100Base-FX, uses two strands of multimode fiber, one for each direction, so 
it, too, is full duplex with 100 Mbps in each direction. In addition, the distance between a 
station and the hub can be up to 2 km. 


In response to popular demand, in 1997 the 802 committee added a new cabling type, 
100Base-T2, allowing fast Ethernet to run over two pairs of existing category 3 wiring. 
However, a sophisticated digital signal processor is needed to handle the encoding scheme 
required, making this option fairly expensive. So far, it is rarely used due to its complexity, 
cost, and the fact that many office buildings have already been rewired with category 5 UTP. 


Two kinds of interconnection devices are possible with 100Base-T: hubs and switches, as 
shown in Fig. 4-20. In a hub, all the incoming lines (or at least all the lines arriving at one 
plug-in card) are logically connected, forming a single collision domain. All the standard rules, 
including the binary exponential backoff algorithm, apply, so the system works just like old- 
fashioned Ethernet. In particular, only one station at a time can be transmitting. In other 
words, hubs require half-duplex communication. 


In a switch, each incoming frame is buffered on a plug-in line card and passed over a high- 
speed backplane from the source card to the destination card if need be. The backplane has 


not been standardized, nor does it need to be, since it is entirely hidden deep inside the 
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switch. If past experience is any guide, switch vendors will compete vigorously to produce ever 
faster backplanes in order to improve system throughput. Because 100Base-FX cables are too 
long for the normal Ethernet collision algorithm, they must be connected to switches, so each 
one is a collision domain unto itself. Hubs are not permitted with 100Base-FX. 


As a final note, virtually all switches can handle a mix of 10-Mbps and 100-Mbps stations, to 
make upgrading easier. As a site acquires more and more 100-Mbps workstations, all it has to 
do is buy the necessary number of new line cards and insert them into the switch. In fact, the 
standard itself provides a way for two stations to automatically negotiate the optimum speed 
(10 or 100 Mbps) and duplexity (half or full). Most fast Ethernet products use this feature to 
autoconfigure themselves. 


4.2.8 Gigabit Ethernet 


The ink was barely dry on the fast Ethernet standard when the 802 committee began working 
on a yet faster Ethernet (1995). It was quickly dubbed gigabit Ethernet and was ratified by 
IEEE in 1998 under the name 802.3z. This identifier suggests that gigabit Ethernet is going to 
be the end of the line unless somebody quickly invents a new letter after z. Below we will 
discuss some of the key features of gigabit Ethernet. More information can be found in 
(Seifert, 1998). 


The 802.3z committee's goals were essentially the same as the 802.3u committee's goals: 
make Ethernet go 10 times faster yet remain backward compatible with all existing Ethernet 
standards. In particular, gigabit Ethernet had to offer unacknowledged datagram service with 
both unicast and multicast, use the same 48-bit addressing scheme already in use, and 
maintain the same frame format, including the minimum and maximum frame sizes. The final 
standard met all these goals. 


All configurations of gigabit Ethernet are point-to-point rather than multidrop as in the original 
10 Mbps standard, now honored as classic Ethernet. In the simplest gigabit Ethernet 
configuration, illustrated in Fig. 4-22(a), two computers are directly connected to each other. 
The more common case, however, is having a switch or a hub connected to multiple computers 
and possibly additional switches or hubs, as shown in Fig. 4-22(b). In both configurations each 
individual Ethernet cable has exactly two devices on it, no more and no fewer. 


Figure 4-22. (a) Atwo-station Ethernet. (b) A multistation Ethernet. 


Switch or hub 


(a) B (b) 


Gigabit Ethernet supports two different modes of operation: full-duplex mode and half-duplex 
mode. The "normal" mode is full-duplex mode, which allows traffic in both directions at the 
same time. This mode is used when there is a central switch connected to computers (or other 
switches) on the periphery. In this configuration, all lines are buffered so each computer and 
switch is free to send frames whenever it wants to. The sender does not have to sense the 
channel to see if anybody else is using it because contention is impossible. On the line between 
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a computer and a switch, the computer is the only possible sender on that line to the switch 
and the transmission succeeds even if the switch is currently sending a frame to the computer 
(because the line is full duplex). Since no contention is possible, the CSMA/CD protocol is not 
used, so the maximum length of the cable is determined by signal strength issues rather than 
by how long it takes for a noise burst to propagate back to the sender in the worst case. 
Switches are free to mix and match speeds. Autoconfiguration is supported just as in fast 
Ethernet. 


The other mode of operation, half-duplex, is used when the computers are connected to a hub 
rather than a switch. A hub does not buffer incoming frames. Instead, it electrically connects 
all the lines internally, simulating the multidrop cable used in classic Ethernet. In this mode, 
collisions are possible, so the standard CSMA/CD protocol is required. Because a minimum 
(i.e., 64-byte) frame can now be transmitted 100 times faster than in classic Ethernet, the 
maximum distance is 100 times less, or 25 meters, to maintain the essential property that the 
sender is still transmitting when the noise burst gets back to it, even in the worst case. With a 
2500-meter-long cable, the sender of a 64-byte frame at 1 Gbps would be long done before 
the frame got even a tenth of the way to the other end, let alone to the end and back. 


The 802.3z committee considered a radius of 25 meters to be unacceptable and added two 
features to the standard to increase the radius. The first feature, called carrier extension, 
essentially tells the hardware to add its own padding after the normal frame to extend the 
frame to 512 bytes. Since this padding is added by the sending hardware and removed by the 
receiving hardware, the software is unaware of it, meaning that no changes are needed to 
existing software. Of course, using 512 bytes worth of bandwidth to transmit 46 bytes of user 
data (the payload of a 64-byte frame) has a line efficiency of 9%. 


The second feature, called frame bursting, allows a sender to transmit a concatenated 
sequence of multiple frames in a single transmission. If the total burst is less than 512 bytes, 
the hardware pads it again. If enough frames are waiting for transmission, this scheme is 
highly efficient and preferred over carrier extension. These new features extend the radius of 
the network to 200 meters, which is probably enough for most offices. 


In all fairness, it is hard to imagine an organization going to the trouble of buying and 
installing gigabit Ethernet cards to get high performance and then connecting the computers 
with a hub to simulate classic Ethernet with all its collisions. While hubs are somewhat cheaper 
than switches, gigabit Ethernet interface cards are still relatively expensive. To then economize 
by buying a cheap hub and slash the performance of the new system is foolish. Still, backward 
compatibility is sacred in the computer industry, so the 802.3z committee was required to put 
it in. 


Gigabit Ethernet supports both copper and fiber cabling, as listed in Fig. 4-23. Signaling at or 
near 1 Gbps over fiber means that the light source has to be turned on and off in under 1 
nsec. LEDs simply cannot operate this fast, so lasers are required. Two wavelengths are 
permitted: 0.85 microns (Short) and 1.3 microns (Long). Lasers at 0.85 microns are cheaper 
but do not work on single-mode fiber. 


Figure 4-23. Gigabit Ethernet cabling. 


Name | Cable 
1000Base-SX | Fiber optics 
1000Base-LX | Fiber optics 
1000Base-CX | 2 Pairs of STP 
1000Base-T 4 Pairs of UTP 


Max, segment 
550m 
5000 m 
25m 
100 m 


Advantages 
Multimode fiber (50, 62.5 microns) 
Single (10 ui) or multimode (50, 62.5 u) 
Shielded twisted pair 

Standard category 5 UTP 


Three fiber diameters are permitted: 10, 50, and 62.5 microns. The first is for single mode and 
the last two are for multimode. Not all six combinations are allowed, however, and the 
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maximum distance depends on the combination used. The numbers given in Fig. 4-23 are for 
the best case. In particular, 5000 meters is only achievable with 1.3 micron lasers operating 
over 10 micron fiber in single mode, but this is the best choice for campus backbones and is 
expected to be popular, despite its being the most expensive choice. 


The 1000Base-CX option uses short shielded copper cables. Its problem is that it is competing 
with high-performance fiber from above and cheap UTP from below. It is unlikely to be used 
much, if at all. 


The last option is bundles of four category 5 UTP wires working together. Because so much of 
this wiring is already installed, it is likely to be the poor man's gigabit Ethernet. 


Gigabit Ethernet uses new encoding rules on the fibers. Manchester encoding at 1 Gbps would 
require a 2 Gbaud signal, which was considered too difficult and also too wasteful of 
bandwidth. Instead a new scheme, called 8B/10B, was chosen, based on fibre channel. Each 
8-bit byte is encoded on the fiber as 10 bits, hence the name 8B/10B. Since there are 1024 
possible output codewords for each input byte, some leeway was available in choosing which 
codewords to allow. The following two rules were used in making the choices: 


1. No codeword may have more than four identical bits in a row. 
2. No codeword may have more than six Os or six 1s. 


These choices were made to keep enough transitions in the stream to make sure the receiver 
stays in sync with the sender and also to keep the number of Os and 1s on the fiber as close to 
equal as possible. In addition, many input bytes have two possible codewords assigned to 
them. When the encoder has a choice of codewords, it always chooses the codeword that 
moves in the direction of equalizing the number of 0s and 1s transmitted so far. This emphasis 
of balancing Os and 1s is needed to keep the DC component of the signal as low as possible to 
allow it to pass through transformers unmodified. While computer scientists are not fond of 
having the properties of transformers dictate their coding schemes, life is like that sometimes. 


Gigabit Ethernets using 1000Base-T use a different encoding scheme since clocking data onto 
copper wire in 1 nsec is too difficult. This solution uses four category 5 twisted pairs to allow 
four symbols to be transmitted in parallel. Each symbol is encoded using one of five voltage 
levels. This scheme allows a single symbol to encode 00, 01, 10, 11, or a special value for 
control purposes. Thus, there are 2 data bits per twisted pair or 8 data bits per clock cycle. The 
clock runs at 125 MHz, allowing 1-Gbps operation. The reason for allowing five voltage levels 
instead of four is to have combinations left over for framing and control purposes. 


A speed of 1 Gbps is quite fast. For example, if a receiver is busy with some other task for 
even 1 msec and does not empty the input buffer on some line, up to 1953 frames may have 
accumulated there in that 1 ms gap. Also, when a computer on a gigabit Ethernet is shipping 
data down the line to a computer on a classic Ethernet, buffer overruns are very likely. As a 
consequence of these two observations, gigabit Ethernet supports flow control (as does fast 
Ethernet, although the two are different). 


The flow control consists of one end sending a special control frame to the other end telling it 
to pause for some period of time. Control frames are normal Ethernet frames containing a type 
of 0x8808. The first two bytes of the data field give the command; succeeding bytes provide 
the parameters, if any. For flow control, PAUSE frames are used, with the parameter telling 
how long to pause, in units of the minimum frame time. For gigabit Ethernet, the time unit is 
512 nsec, allowing for pauses as long as 33.6 msec. 


As soon as gigabit Ethernet was standardized, the 802 committee got bored and wanted to get 
back to work. IEEE told them to start on 10-gigabit Ethernet. After searching hard for a letter 
to follow z, they abandoned that approach and went over to two-letter suffixes. They got to 
work and that standard was approved by IEEE in 2002 as 802.3ae. Can 100-gigabit Ethernet 
be far behind? 
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4.2.9 IEEE 802.2: Logical Link Control 


It is now perhaps time to step back and compare what we have learned in this chapter with 
what we studied in the previous one. In Chap. 3, we saw how two machines could 
communicate reliably over an unreliable line by using various data link protocols. These 
protocols provided error control (using acknowledgements) and flow control (using a sliding 
window). 


In contrast, in this chapter, we have not said a word about reliable communication. All that 
Ethernet and the other 802 protocols offer is a best-efforts datagram service. Sometimes, this 
service is adequate. For example, for transporting IP packets, no guarantees are required or 
even expected. An IP packet can just be inserted into an 802 payload field and sent on its way. 
If it gets lost, so be it. 


Nevertheless, there are also systems in which an error-controlled, flow-controlled data link 
protocol is desired. IEEE has defined one that can run on top of Ethernet and the other 802 
protocols. In addition, this protocol, called LLC (Logical Link Control), hides the differences 
between the various kinds of 802 networks by providing a single format and interface to the 
network layer. This format, interface, and protocol are all closely based on the HDLC protocol 
we studied in Chap. 3. LLC forms the upper half of the data link layer, with the MAC sublayer 
below it, as shown in Fig. 4-24. 


Figure 4-24. (a) Position of LLC. (b) Protocol formats. 
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Typical usage of LLC is as follows. The network layer on the sending machine passes a packet 
to LLC, using the LLC access primitives. The LLC sublayer then adds an LLC header, containing 
sequence and acknowledgement numbers. The resulting structure is then inserted into the 
payload field of an 802 frame and transmitted. At the receiver, the reverse process takes 
place. 


LLC provides three service options: unreliable datagram service, acknowledged datagram 
service, and reliable connection-oriented service. The LLC header contains three fields: a 
destination access point, a source access point, and a control field. The access points tell which 
process the frame came from and where it is to be delivered, replacing the DIX Type field. The 
control field contains sequence and acknowledgement numbers, very much in the style of 
HDLC (see Fig. 3-24), but not identical to it. These fields are primarily used when a reliable 
connection is needed at the data link level, in which case protocols similar to the ones 
discussed in Chap. 3 would be used. For the Internet, best-efforts attempts to deliver IP 
packets is sufficient, so no acknowledgements at the LLC level are required. 


4.2.10 Retrospective on Ethernet 


Ethernet has been around for over 20 years and has no serious competitors in sight, so it is 
likely to be around for many years to come. Few CPU architectures, operating systems, or 
programming languages have been king of the mountain for two decades going on three. 
Clearly, Ethernet did something right. What? 


176 


Probably the main reason for its longevity is that Ethernet is simple and flexible. In practice, 
simple translates into reliable, cheap, and easy to maintain. Once the vampire taps were 
replaced by BNC connectors, failures became extremely rare. People hesitate to replace 
something that works perfectly all the time, especially when they know that an awful lot of 
things in the computer industry work very poorly, so that many so-called "upgrades" are 
appreciably worse than what they replaced. 


Simple also translates into cheap. Thin Ethernet and twisted pair wiring is relatively 
inexpensive. The interface cards are also low cost. Only when hubs and switches were 
introduced were substantial investments required, but by the time they were in the picture, 
Ethernet was already well established. 


Ethernet is easy to maintain. There is no software to install (other than the drivers) and there 
are no configuration tables to manage (and get wrong). Also, adding new hosts is as simple as 
just plugging them in. 


Another point is that Ethernet interworks easily with TCP/IP, which has become dominant. IP is 
a connectionless protocol, so it fits perfectly with Ethernet, which is also connectionless. IP fits 
much less well with ATM, which is connection oriented. This mismatch definitely hurt ATM's 
chances. 


Lastly, Ethernet has been able to evolve in certain crucial ways. Speeds have gone up by 
several orders of magnitude and hubs and switches have been introduced, but these changes 
have not required changing the software. When a network salesman shows up at a large 
installation and says: "I have this fantastic new network for you. All you have to do is throw 
out all your hardware and rewrite all your software," he has a problem. FDDI, Fibre Channel, 
and ATM were all faster than Ethernet when introduced, but they were incompatible with 
Ethernet, far more complex, and harder to manage. Eventually, Ethernet caught up with them 
in terms of speed, so they had no advantages left and quietly died off except for ATM's use 
deep within the core of the telephone system. 


4.3 Wireless LANs 


Although Ethernet is widely used, it is about to get some competition. Wireless LANs are 
increasingly popular, and more and more office buildings, airports, and other public places are 
being outfitted with them. Wireless LANs can operate in one of two configurations, as we saw 
in Fig. 1-35: with a base station and without a base station. Consequently, the 802.11 LAN 
standard takes this into account and makes provision for both arrangements, as we will see 
shortly. 


We gave some background information on 802.11 in Sec. 1.5.4. Now is the time to take a 
closer look at the technology. In the following sections we will look at the protocol stack, 
physical layer radio transmission techniques, MAC sublayer protocol, frame structure, and 
services. For more information about 802.11, see (Crow et al., 1997; Geier, 2002; Heegard et 
al., 2001; Kapp, 2002; O'Hara and Petrick, 1999; and Severance, 1999). To hear the truth 
from the mouth of the horse, consult the published 802.11 standard itself. 


4.3.1 The 802.11 Protocol Stack 


The protocols used by all the 802 variants, including Ethernet, have a certain commonality of 
structure. A partial view of the 802.11 protocol stack is given in Fig. 4-25. The physical layer 
corresponds to the OSI physical layer fairly well, but the data link layer in all the 802 protocols 
is split into two or more sublayers. In 802.11, the MAC (Medium Access Control) sublayer 
determines how the channel is allocated, that is, who gets to transmit next. Above it is the LLC 
(Logical Link Control) sublayer, whose job it is to hide the differences between the different 
802 variants and make them indistinguishable as far as the network layer is concerned. We 
studied the LLC when examining Ethernet earlier in this chapter and will not repeat that 
material here. 
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Figure 4-25. Part of the 802.11 protocol stack. 
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The 1997 802.11 standard specifies three transmission techniques allowed in the physical 
layer. The infrared method uses much the same technology as television remote controls do. 
The other two use short-range radio, using techniques called FHSS and DSSS. Both of these 
use a part of the spectrum that does not require licensing (the 2.4-GHz ISM band). Radio- 
controlled garage door openers also use this piece of the spectrum, so your notebook 
computer may find itself in competition with your garage door. Cordless telephones and 
microwave ovens also use this band. All of these techniques operate at 1 or 2 Mbps and at low 
enough power that they do not conflict too much. In 1999, two new techniques were 
introduced to achieve higher bandwidth. These are called OFDM and HR-DSSS. They operate at 
up to 54 Mbps and 11 Mbps, respectively. In 2001, a second OFDM modulation was introduced, 
but in a different frequency band from the first one. Now we will examine each of them briefly. 
Technically, these belong to the physical layer and should have been examined in Chapter 2, 
but since they are so closely tied to LANs in general and the 802.11 MAC sublayer, we treat 
them here instead. 


4.3.2 The 802.11 Physical Layer 


Each of the five permitted transmission techniques makes it possible to send a MAC frame 
from one station to another. They differ, however, in the technology used and speeds 
achievable. A detailed discussion of these technologies is far beyond the scope of this book, 
but a few words on each one, along with some of the key words, may provide interested 
readers with terms to search for on the Internet or elsewhere for more information. 


The infrared option uses diffused (i.e., not line of sight) transmission at 0.85 or 0.95 microns. 
Two speeds are permitted: 1 Mbps and 2 Mbps. At 1 Mbps, an encoding scheme is used in 
which a group of 4 bits is encoded as a 16-bit codeword containing fifteen Os and a single 1, 


using what is called Gray code. This code has the property that a small error in time 
synchronization leads to only a single bit error in the output. At 2 Mbps, the encoding takes 2 
bits and produces a 4-bit codeword, also with only a single 1, that is one of 0001, 0010, 0100, 
or 1000. Infrared signals cannot penetrate walls, so cells in different rooms are well isolated 
from each other. Nevertheless, due to the low bandwidth (and the fact that sunlight swamps 
infrared signals), this is not a popular option. 


FHSS (Frequency Hopping Spread Spectrum) uses 79 channels, each 1-MHz wide, starting 

at the low end of the 2.4-GHz ISM band. A pseudorandom number generator is used to 

produce the sequence of frequencies hopped to. As long as all stations use the same seed to 

the pseudorandom number generator and stay synchronized in time, they will hop to the same 

frequencies simultaneously. The amount of time spent at each frequency, the dwell time, is 

an adjustable parameter, but must be less than 400 msec. FHSS' randomization provides a fair 
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way to allocate spectrum in the unregulated ISM band. It also provides a modicum of security 
since an intruder who does not know the hopping sequence or dwell time cannot eavesdrop on 
transmissions. Over longer distances, multipath fading can be an issue, and FHSS offers good 
resistance to it. It is also relatively insensitive to radio interference, which makes it popular for 
building-to-building links. Its main disadvantage is its low bandwidth. 


The third modulation method, DSSS (Direct Sequence Spread Spectrum), is also restricted 
to 1 or 2 Mbps. The scheme used has some similarities to the CDMA system we examined in 
Sec. 2.6.2, but differs in other ways. Each bit is transmitted as 11 chips, using what is called a 
Barker sequence. It uses phase shift modulation at 1 Mbaud, transmitting 1 bit per baud 
when operating at 1 Mbps and 2 bits per baud when operating at 2 Mbps. For years, the FCC 
required all wireless communications equipment operating in the ISM bands in the U.S. to use 
spread spectrum, but in May 2002, that rule was dropped as new technologies emerged. 


The first of the high-speed wireless LANs, 802.11a, uses OFDM (Orthogonal Frequency 
Division Multiplexing) to deliver up to 54 Mbps in the wider 5-GHz ISM band. As the term 
FDM suggests, different frequencies are used—52 of them, 48 for data and 4 for 
synchronization—not unlike ADSL. Since transmissions are present on multiple frequencies at 
the same time, this technique is considered a form of spread spectrum, but different from both 
CDMA and FHSS. Splitting the signal into many narrow bands has some key advantages over 
using a single wide band, including better immunity to narrowband interference and the 
possibility of using noncontiguous bands. A complex encoding system is used, based on phase- 
shift modulation for speeds up to 18 Mbps and on QAM above that. At 54 Mbps, 216 data bits 
are encoded into 288-bit symbols. Part of the motivation for OFDM is compatibility with the 
European HiperLAN/2 system (Doufexi et al., 2002). The technique has a good spectrum 
efficiency in terms of bits/Hz and good immunity to multipath fading. 


Next, we come to HR-DSSS (High Rate Direct Sequence Spread Spectrum), another 
spread spectrum technique, which uses 11 million chips/sec to achieve 11 Mbps in the 2.4-GHz 
band. It is called 802.11b but is not a follow-up to 802.11a. In fact, its standard was 
approved first and it got to market first. Data rates supported by 802.11b are 1, 2, 5.5, and 11 
Mbps. The two slow rates run at 1 Mbaud, with 1 and 2 bits per baud, respectively, using 
phase shift modulation (for compatibility with DSSS). The two faster rates run at 1.375 Mbaud, 
with 4 and 8 bits per baud, respectively, using Walsh/Hadamard codes. The data rate may 
be dynamically adapted during operation to achieve the optimum speed possible under current 
conditions of load and noise. In practice, the operating speed of 802.11b is nearly always 11 
Mbps. Although 802.11b is slower than 802.11a, its range is about 7 times greater, which is 
more important in many situations. 


An enhanced version of 802.11b, 802.11g, was approved by IEEE in November 2001 after 
much politicking about whose patented technology it would use. It uses the OFDM modulation 
method of 802.11a but operates in the narrow 2.4-GHz ISM band along with 802.11b. In 
theory it can operate at up to 54 MBps. It is not yet clear whether this speed will be realized in 
practice. What it does mean is that the 802.11 committee has produced three different high- 
speed wireless LANs: 802.11a, 802.11b, and 802.11g (not to mention three low-speed 


wireless LANs). One can legitimately ask if this is a good thing for a standards committee to 
do. Maybe three was their lucky number. 


4.3.3 The 802.11 MAC Sublayer Protocol 


Let us now return from the land of electrical engineering to the land of computer science. The 
802.11 MAC sublayer protocol is quite different from that of Ethernet due to the inherent 
complexity of the wireless environment compared to that of a wired system. With Ethernet, a 
station just waits until the ether goes silent and starts transmitting. If it does not receive a 
noise burst back within the first 64 bytes, the frame has almost assuredly been delivered 
correctly. With wireless, this situation does not hold. 


To start with, there is the hidden station problem mentioned earlier and illustrated again in Fig. 
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4-26(a). Since not all stations are within radio range of each other, transmissions going on in 
one part of a cell may not be received elsewhere in the same cell. In this example, station C is 
transmitting to station B. If A senses the channel, it will not hear anything and falsely conclude 
that it may now start transmitting to B. 


Figure 4-26. (a) The hidden station problem. (b) The exposed station 
problem. 
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In addition, there is the inverse problem, the exposed station problem, illustrated in Fig. 4- 
26(b). Here B wants to send to C so it listens to the channel. When it hears a transmission, it 
falsely concludes that it may not send to C, even though A may be transmitting to D (not 
shown). In addition, most radios are half duplex, meaning that they cannot transmit and listen 
for noise bursts at the same time on a single frequency. As a result of these problems, 802.11 
does not use CSMA/CD, as Ethernet does. 


To deal with this problem, 802.11 supports two modes of operation. The first, called DCF 
(Distributed Coordination Function), does not use any kind of central control (in that 
respect, similar to Ethernet). The other, called PCF (Point Coordination Function), uses the 
base station to control all activity in its cell. All implementations must support DCF but PCF is 
optional. We will now discuss these two modes in turn. 


When DCF is employed, 802.11 uses a protocol called CSMA/CA (CSMA with Collision 
Avoidance). In this protocol, both physical channel sensing and virtual channel sensing are 
used. Two methods of operation are supported by CSMA/CA. In the first method, when a 
station wants to transmit, it senses the channel. If it is idle, it just starts transmitting. It does 
not sense the channel while transmitting but emits its entire frame, which may well be 
destroyed at the receiver due to interference there. If the channel is busy, the sender defers 
until it goes idle and then starts transmitting. If a collision occurs, the colliding stations wait a 


random time, using the Ethernet binary exponential backoff algorithm, and then try again 
later. 


The other mode of CSMA/CA operation is based on MACAW and uses virtual channel sensing, 
as illustrated in Fig. 4-27. In this example, A wants to send to B. C is a station within range of 
A (and possibly within range of B, but that does not matter). D is a station within range of B 
but not within range of A. 
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Figure 4-27. The use of virtual channel sensing using CSMA/CA. 
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The protocol starts when A decides it wants to send data to B. It begins by sending an RTS 
frame to B to request permission to send it a frame. When B receives this request, it may 
decide to grant permission, in which case it sends a CTS frame back. Upon receipt of the CTS, 
A now sends its frame and starts an ACK timer. Upon correct receipt of the data frame, B 
responds with an ACK frame, terminating the exchange. If A's ACK timer expires before the 
ACK gets back to it, the whole protocol is run again. 


Now let us consider this exchange from the viewpoints of C and D. C is within range of A, so it 
may receive the RTS frame. If it does, it realizes that someone is going to send data soon, so 
for the good of all it desists from transmitting anything until the exchange is completed. From 
the information provided in the RTS request, it can estimate how long the sequence will take, 
including the final ACK, so it asserts a kind of virtual channel busy for itself, indicated by NAV 
(Network Allocation Vector) in Fig. 4-27. D does not hear the RTS, but it does hear the 
CTS, so it also asserts the NAV signal for itself. Note that the NAV signals are not transmitted; 
they are just internal reminders to keep quiet for a certain period of time. 


In contrast to wired networks, wireless networks are noisy and unreliable, in no small part due 
to microwave ovens, which also use the unlicensed ISM bands. As a consequence, the 
probability of a frame making it through successfully decreases with frame length. If the 
probability of any bit being in error is p, then the probability of an n-bit frame being received 
entirely correctly is (1 - p)". For example, for p = 104, the probability of receiving a full 
Ethernet frame (12,144 bits) correctly is less than 30%. If p = 10^, about one frame in 9 will 
be damaged. Even if p = 10°, over 1% of the frames will be damaged, which amounts to 
almost a dozen per second, and more if frames shorter than the maximum are used. In 
summary, if a frame is too long, it has very little chance of getting through undamaged and 
will probably have to be retransmitted. 


To deal with the problem of noisy channels, 802.11 allows frames to be fragmented into 
smaller pieces, each with its own checksum. The fragments are individually numbered and 
acknowledged using a stop-and-wait protocol (i.e., the sender may not transmit fragment k + 
1 until it has received the acknowledgment for fragment k). Once the channel has been 
acquired using RTS and CTS, multiple fragments can be sent in a row, as shown in Fig. 4-28. 
sequence of fragments is called a fragment burst. 


181 


Figure 4-28. A fragment burst. 
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Fragmentation increases the throughput by restricting retransmissions to the bad fragments 
rather than the entire frame. The fragment size is not fixed by the standard but is a parameter 
of each cell and can be adjusted by the base station. The NAV mechanism keeps other stations 
quiet only until the next acknowledgement, but another mechanism (described below) is used 
to allow a whole fragment burst to be sent without interference. 


All of the above discussion applies to the 802.11 DCF mode. In this mode, there is no central 
control, and stations compete for air time, just as they do with Ethernet. The other allowed 
mode is PCF, in which the base station polls the other stations, asking them if they have any 
frames to send. Since transmission order is completely controlled by the base station in PCF 
mode, no collisions ever occur. The standard prescribes the mechanism for polling, but not the 
polling frequency, polling order, or even whether all stations need to get equal service. 


The basic mechanism is for the base station to broadcast a beacon frame periodically (10 to 
100 times per second). The beacon frame contains system parameters, such as hopping 
sequences and dwell times (for FHSS), clock synchronization, etc. It also invites new stations 
to sign up for polling service. Once a station has signed up for polling service at a certain rate, 
it is effectively guaranteed a certain fraction of the bandwidth, thus making it possible to give 
quality-of-service guarantees. 


Battery life is always an issue with mobile wireless devices, so 802.11 pays attention to the 
issue of power management. In particular, the base station can direct a mobile station to go 
into sleep state until explicitly awakened by the base station or the user. Having told a station 
to go to sleep, however, means that the base station has the responsibility for buffering any 
frames directed at it while the mobile station is asleep. These can be collected later. 


PCF and DCF can coexist within one cell. At first it might seem impossible to have central 
control and distributed control operating at the same time, but 802.11 provides a way to 
achieve this goal. It works by carefully defining the interframe time interval. After a frame has 
been sent, a certain amount of dead time is required before any station may send a frame. 
Four different intervals are defined, each for a specific purpose. The four intervals are depicted 
in Fig. 4-29. 


Figure 4-29. Interframe spacing in 802.11 
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The shortest interval is SIFS (Short InterFrame Spacing). It is used to allow the parties in a 
single dialog the chance to go first. This includes letting the receiver send a CTS to respond to 
an RTS, letting the receiver send an ACK for a fragment or full data frame, and letting the 
sender of a fragment burst transmit the next fragment without having to send an RTS again. 


There is always exactly one station that is entitled to respond after a SIFS interval. If it fails to 
make use of its chance and a time PIFS (PCF InterFrame Spacing) elapses, the base station 
may send a beacon frame or poll frame. This mechanism allows a station sending a data frame 
or fragment sequence to finish its frame without anyone else getting in the way, but gives the 
base station a chance to grab the channel when the previous sender is done without having to 
compete with eager users. 


If the base station has nothing to say and a time DIFS (DCF InterFrame Spacing) elapses, 
any station may attempt to acquire the channel to send a new frame. The usual contention 
rules apply, and binary exponential backoff may be needed if a collision occurs. 


The last time interval, EIFS (Extended InterFrame Spacing), is used only by a station that 
has just received a bad or unknown frame to report the bad frame. The idea of giving this 
event the lowest priority is that since the receiver may have no idea of what is going on, it 
should wait a substantial time to avoid interfering with an ongoing dialog between two 
stations. 


4.3.4 The 802.11 Frame Structure 


The 802.11 standard defines three different classes of frames on the wire: data, control, and 
management. Each of these has a header with a variety of fields used within the MAC 
sublayer. In addition, there are some headers used by the physical layer but these mostly deal 
with the modulation techniques used, so we will not discuss them here. 


The format of the data frame is shown in Fig. 4-30. First comes the Frame Control field. It 
itself has 11 subfields. The first of these is the Protocol version, which allows two versions of 
the protocol to operate at the same time in the same cell. Then come the Type (data, control, 
or management) and Subtype fields (e.g., RTS or CTS). The To DS and From DS bits indicate 
the frame is going to or coming from the intercell distribution system (e.g., Ethernet). The MF 
bit means that more fragments will follow. The Retry bit marks a retransmission of a frame 
sent earlier. The Power management bit is used by the base station to put the receiver into 
sleep state or take it out of sleep state. The More bit indicates that the sender has additional 
frames for the receiver. The W bit specifies that the frame body has been encrypted using the 
WEP (Wired Equivalent Privacy) algorithm. Finally, the O bit tells the receiver that a 
sequence of frames with this bit on must be processed strictly in order. 


Figure 4-30. The 802.11 data frame. 
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The second field of the data frame, the Duration field, tells how long the frame and its 
acknowledgement will occupy the channel. This field is also present in the control frames and 
is how other stations manage the NAV mechanism. The frame header contains four addresses, 
all in standard IEEE 802 format. The source and destination are obviously needed, but what 
are the other two for? Remember that frames may enter or leave a cell via a base station. The 


183 


other two addresses are used for the source and destination base stations for intercell traffic. 


The Sequence field allows fragments to be numbered. Of the 16 bits available, 12 identify the 
frame and 4 identify the fragment. The Data field contains the payload, up to 2312 bytes, 
followed by the usual Checksum. 


Management frames have a format similar to that of data frames, except without one of the 
base station addresses, because management frames are restricted to a single cell. Control 
frames are shorter still, having only one or two addresses, no Data field, and no Sequence 
field. The key information here is in the Subtype field, usually RTS, CTS, or ACK. 


4.3.5 Services 


The 802.11 standard states that each conformant wireless LAN must provide nine services. 
These services are divided into two categories: five distribution services and four station 
services. The distribution services relate to managing cell membership and interacting with 
stations outside the cell. In contrast, the station services relate to activity within a single cell. 


The five distribution services are provided by the base stations and deal with station mobility 
as they enter and leave cells, attaching themselves to and detaching themselves from base 
stations. They are as follows. 


1. 


Association. This service is used by mobile stations to connect themselves to base 
stations. Typically, it is used just after a station moves within the radio range of the 
base station. Upon arrival, it announces its identity and capabilities. The capabilities 
include the data rates supported, need for PCF services (i.e., polling), and power 
management requirements. The base station may accept or reject the mobile station. If 
the mobile station is accepted, it must then authenticate itself. 

Disassociation. Either the station or the base station may disassociate, thus breaking 
the relationship. A station should use this service before shutting down or leaving, but 
the base station may also use it before going down for maintenance. 

Reassociation. A station may change its preferred base station using this service. This 
facility is useful for mobile stations moving from one cell to another. If it is used 
correctly, no data will be lost as a consequence of the handover. (But 802.11, like 
Ethernet, is just a best-efforts service.) 

Distribution. This service determines how to route frames sent to the base station. If 
the destination is local to the base station, the frames can be sent out directly over the 
air. Otherwise, they will have to be forwarded over the wired network. 

Integration. If a frame needs to be sent through a non-802.11 network with a 
different addressing scheme or frame format, this service handles the translation from 
the 802.11 format to the format required by the destination network. 


The remaining four services are intracell (i.e., relate to actions within a single cell). They are 
used after association has taken place and are as follows. 


oe 


Authentication. Because wireless communication can easily be sent or received by 
unauthorized stations, a station must authenticate itself before it is permitted to send 
data. After a mobile station has been associated by the base station (i.e., accepted into 
its cell), the base station sends a special challenge frame to it to see if the mobile 
station knows the secret key (password) that has been assigned to it. It proves its 
knowledge of the secret key by encrypting the challenge frame and sending it back to 
the base station. If the result is correct, the mobile is fully enrolled in the cell. In the 
initial standard, the base station does not have to prove its identity to the mobile 
station, but work to repair this defect in the standard is underway. 
Deauthentication. When a previously authenticated station wants to leave the 
network, it is deauthenticated. After deauthentication, it may no longer use the 
network. 

Privacy. For information sent over a wireless LAN to be kept confidential, it must be 
encrypted. This service manages the encryption and decryption. The encryption 
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algorithm specified is RC4, invented by Ronald Rivest of M.I.T. 

4. Data delivery. Finally, data transmission is what it is all about, so 802.11 naturally 
provides a way to transmit and receive data. Since 802.11 is modeled on Ethernet and 
transmission over Ethernet is not guaranteed to be 100% reliable, transmission over 
802.11 is not guaranteed to be reliable either. Higher layers must deal with detecting 
and correcting errors. 


An 802.11 cell has some parameters that can be inspected and, in some cases, adjusted. They 
relate to encryption, timeout intervals, data rates, beacon frequency, and so on. 


Wireless LANs based on 802.11 are starting to be deployed in office buildings, airports, hotels, 
restaurants, and campuses around the world. Rapid growth is expected. For some experience about 
the widespread deployment of 802.11 at CMU, see (Hills, 2001). 


4.4 Broadband Wireless 


We have been indoors too long. Let us now go outside and see if any interesting networking is 
going on there. It turns out that quite a bit is going on there, and some of it has to do with the 
so-called last mile. With the deregulation of the telephone system in many countries, 
competitors to the entrenched telephone company are now often allowed to offer local voice 
and high-speed Internet service. There is certainly plenty of demand. The problem is that 
running fiber, coax, or even category 5 twisted pair to millions of homes and businesses is 
prohibitively expensive. What is a competitor to do? 


The answer is broadband wireless. Erecting a big antenna on a hill just outside of town and 
installing antennas directed at it on customers' roofs is much easier and cheaper than digging 
trenches and stringing cables. Thus, competing telecommunication companies have a great 
interest in providing a multimegabit wireless communication service for voice, Internet, movies 
on demand, etc. As we saw in Fig. 2-30, LMDS was invented for this purpose. However, until 
recently, every carrier devised its own system. This lack of standards meant that hardware and 
software could not be mass produced, which kept prices high and acceptance low. 


Many people in the industry realized that having a broadband wireless standard was the key 
element missing, so IEEE was asked to form a committee composed of people from key 
companies and academia to draw up the standard. The next number available in the 802 
numbering space was 802.16, so the standard got this number. Work was started in July 
1999, and the final standard was approved in April 2002. Officially the standard is called "Air 
Interface for Fixed Broadband Wireless Access Systems." However, some people prefer to call 


it a wireless MAN (Metropolitan Area Network) or a wireless local loop. We regard all 
these terms as interchangeable. 


Like some of the other 802 standards, 802.16 was heavily influenced by the OSI model, 
including the (sub)layers, terminology, service primitives, and more. Unfortunately, also like 
OSI, it is fairly complicated. In the following sections we will give a brief description of some of 
the highlights of 802.16, but this treatment is far from complete and leaves out many details. 
For additional information about broadband wireless in general, see (Bolcskei et al., 2001; and 
Webb, 2001). For information about 802.16 in particular, see (Eklund et al., 2002). 


4.4.1 Comparison of 802.11 with 802.16 


At this point you may be thinking: Why devise a new standard? Why not just use 802.11? 
There are some very good reasons for not using 802.11, primarily because 802.11 and 802.16 
solve different problems. Before getting into the technology of 802.16, it is probably 
worthwhile saying a few words about why a new standard is needed at all. 
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The environments in which 802.11 and 802.16 operate are similar in some ways, primarily in 
that they were designed to provide high-bandwidth wireless communications. But they also 
differ in some major ways. To start with, 802.16 provides service to buildings, and buildings 
are not mobile. They do not migrate from cell to cell often. Much of 802.11 deals with mobility, 
and none of that is relevant here. Next, buildings can have more than one computer in them, a 
complication that does not occur when the end station is a single notebook computer. Because 
building owners are generally willing to spend much more money for communication gear than 
are notebook owners, better radios are available. This difference means that 802.16 can use 
full-duplex communication, something 802.11 avoids to keep the cost of the radios low. 


Because 802.16 runs over part of a city, the distances involved can be several kilometers, 
which means that the perceived power at the base station can vary widely from station to 
station. This variation affects the signal-to-noise ratio, which, in, turn, dictates multiple 
modulation schemes. Also, open communication over a city means that security and privacy 
are essential and mandatory. 


Furthermore, each cell is likely to have many more users than will a typical 802.11 cell, and 
these users are expected to use more bandwidth than will a typical 802.11 user. After all it is 
rare for a company to invite 50 employees to show up in a room with their laptops to see if 
they can saturate the 802.11 wireless network by watching 50 separate movies at once. For 
this reason, more spectrum is needed than the ISM bands can provide, forcing 802.16 to 
operate in the much higher 10-to-66 GHz frequency range, the only place unused spectrum is 
still available. 


But these millimeter waves have different physical properties than the longer waves in the ISM 
bands, which in turn requires a completely different physical layer. One property that 
millimeter waves have is that they are strongly absorbed by water (especially rain, but to 
some extent also by snow, hail, and with a bit of bad luck, heavy fog). Consequently, error 
handling is more important than in an indoor environment. Millimeter waves can be focused 
into directional beams (802.11 is omnidirectional), so choices made in 802.11 relating to 
multipath propagation are moot here. 


Another issue is quality of service. While 802.11 provides some support for real-time traffic 
(using PCF mode), it was not really designed for telephony and heavy-duty multimedia usage. 
In contrast, 802.16 is expected to support these applications completely because it is intended 
for residential as well as business use. 


In short, 802.11 was designed to be mobile Ethernet, whereas 802.16 was designed to be 
wireless, but stationary, cable television. These differences are so big that the resulting 
standards are very different as they try to optimize different things. 


A very brief comparison with the cellular phone system is also worthwhile. With mobile phones, 
we are talking about narrow-band, voice-oriented, low-powered, mobile stations that 
communicate using medium-length microwaves. Nobody watches high-resolution, two-hour 
movies on GSM mobile phones (yet). Even UMTS has little hope of changing this situation. In 
short, the wireless MAN world is far more demanding than is the mobile phone world, so a 
completely different system is needed. Whether 802.16 could be used for mobile devices in the 
future is an interesting question. It was not optimized for them, but the possibility is there. For 
the moment it is focused on fixed wireless. 


4.4.2 The 802.16 Protocol Stack 


The 802.16 protocol stack is illustrated in Fig. 4-31. The general structure is similar to that of 
the other 802 networks, but with more sublayers. The bottom sublayer deals with 
transmission. Traditional narrow-band radio is used with conventional modulation schemes. 
Above the physical transmission layer comes a convergence sublayer to hide the different 
technologies from the data link layer. Actually, 802.11 has something like this too, only the 
committee chose not to formalize it with an OSI-type name. 
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Figure 4-31. The 802.16 protocol stack. 
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Although we have not shown them in the figure, work is already underway to add two new 
physical layer protocols. The 802.16a standard will support OFDM in the 2-to-11 GHz 
frequency range. The 802.16b standard will operate in the 5-GHz ISM band. Both of these are 
attempts to move closer to 802.11. 


The data link layer consists of three sublayers. The bottom one deals with privacy and security, 
which is far more crucial for public outdoor networks than for private indoor networks. It 
manages encryption, decryption, and key management. 


Next comes the MAC sublayer common part. This is where the main protocols, such as channel 
management, are located. The model is that the base station controls the system. It can 
schedule the downstream (i.e., base to subscriber) channels very efficiently and plays a major 
role in managing the upstream (i.e., subscriber to base) channels as well. An unusual feature 
of the MAC sublayer is that, unlike those of the other 802 networks, it is completely connection 
oriented, in order to provide quality-of-service guarantees for telephony and multimedia 
communication. 


The service-specific convergence sublayer takes the place of the logical link sublayer in the 
other 802 protocols. Its function is to interface to the network layer. A complication here is 
that 802.16 was designed to integrate seamlessly with both datagram protocols (e.g., PPP, IP, 
and Ethernet) and ATM. The problem is that packet protocols are connectionless and ATM is 
connection oriented. This means that every ATM connection has to map onto an 802.16 
connection, in principle a straightforward matter. But onto which 802.16 connection should an 
incoming IP packet be mapped? That problem is dealt with in this sublayer. 


4.4.3 The 802.16 Physical Layer 


As mentioned above, broadband wireless needs a lot of spectrum, and the only place to find it 
is in the 10-to-66 GHz range. These millimeter waves have an interesting property that longer 
microwaves do not: they travel in straight lines, unlike sound but similar to light. As a 
consequence, the base station can have multiple antennas, each pointing at a different sector 
of the surrounding terrain, as shown in Fig. 4-32. Each sector has its own users and is fairly 
independent of the adjoining ones, something not true of cellular radio, which is 
omnidirectional. 
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Figure 4-32. The 802.16 transmission environment. 
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Because signal strength in the millimeter band falls off sharply with distance from the base 
station, the signal-to-noise ratio also drops with distance from the base station. For this 
reason, 802.16 employs three different modulation schemes, depending on how far the 
subscriber station is from the base station. For close-in subscribers, QAM-64 is used, with 6 
bits/baud. For medium-distance subscribers, QAM-16 is used, with 4 bits/baud. For distant 
subscribers, QPSK is used, with 2 bits/baud. For example, for a typical value of 25 MHz worth 
of spectrum, QAM-64 gives 150 Mbps, QAM-16 gives 100 Mbps, and QPSK gives 50 Mbps. In 
other words, the farther the subscriber is from the base station, the lower the data rate 
(similar to what we saw with ADSL in Fig. 2-27). The constellation diagrams for these three 
modulation techniques were shown in Fig. 2-25. 


QAM-64 (6 bits/baud) 
QAM-16 (4 bits/baud) 
QPSK (2 bits/baud) 


Given the goal of producing a broadband system, and subject to the above physical 
constraints, the 802.16 designers worked hard to use the available spectrum efficiently. One 
thing they did not like was the way GSM and DAMPS work. Both of those use different but 
equal frequency bands for upstream and downstream traffic. For voice, traffic is probably 
symmetric for the most part, but for Internet access, there is often more downstream traffic 
than upstream traffic. Consequently, 802.16 provides a more flexible way to allocate the 
bandwidth. Two schemes are used, FDD (Frequency Division Duplexing) and TDD (Time 
Division Duplexing). The latter is illustrated in Fig. 4-33. Here the base station periodically 
sends out frames. Each frame contains time slots. The first ones are for downstream traffic. 
Then comes a guard time used by the stations to switch direction. Finally, we have slots for 
upstream traffic. The number of time slots devoted to each direction can be changed 
dynamically to match the bandwidth in each direction to the traffic. 


Figure 4-33. Frames and time slots for time division duplexing. 
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Downstream traffic is mapped onto time slots by the base station. The base station is 
completely in control for this direction. Upstream traffic is more complex and depends on the 
quality of service required. We will come to slot allocation when we discuss the MAC sublayer 
below. 


Another interesting feature of the physical layer is its ability to pack multiple MAC frames 


back-to back in a single physical transmission. The feature enhances spectral efficiency by 
reducing the number of preambles and physical layer headers needed. 


188 


Also noteworthy is the use of Hamming codes to do forward error correction in the physical 
layer. Nearly all other networks simply rely on checksums to detect errors and request 
retransmission when frames are received in error. But in the wide area broadband 
environment, so many transmission errors are expected that error correction is employed in 
the physical layer, in addition to checksums in the higher layers. The net effect of the error 
correction is to make the channel look better than it really is (in the same way that CD-ROMs 
appear to be very reliable, but only because more than half the total bits are devoted to error 
correction in the physical layer). 


4.4.4 The 802.16 MAC Sublayer Protocol 


The data link layer is divided into three sublayers, as we saw in Fig. 4-31. Since we will not 
study cryptography until Chap. 8, it is difficult to explain now how the security sublayer works. 
Suffice it to say that encryption is used to keep secret all data transmitted. Only the frame 
payloads are encrypted; the headers are not. This property means that a snooper can see who 
is talking to whom but cannot tell what they are saying to each other. 


If you already know something about cryptography, here comes a one-paragraph explanation 

of the security sublayer. If you know nothing about cryptography, you are not likely to find the 
next paragraph terribly enlightening (but you might consider rereading it after finishing Chap. 

8). 


At the time a subscriber connects to a base station, they perform mutual authentication with 
RSA public-key cryptography using X.509 certificates. The payloads themselves are encrypted 
using a symmetric-key system, either DES with cipher block chaining or triple DES with two 
keys. AES (Rijndael) is likely to be added soon. Integrity checking uses SHA-1. Now that was 
not so bad, was it? 


Let us now look at the MAC sublayer common part. MAC frames occupy an integral number of 
physical layer time slots. Each frame is composed of sub-frames, the first two of which are the 
downstream and upstream maps. These maps tell what is in which time slot and which time 
slots are free. The downstream map also contains various system parameters to inform new 
stations as they come on-line. 


The downstream channel is fairly straightforward. The base station simply decides what to put 
in which subframe. The upstream channel is more complicated since there are competing 
uncoordinated subscribers that need access to it. Its allocation is tied closely to the quality-of- 
service issue. Four classes of service are defined as follows: 


Constant bit rate service. 

Real-time variable bit rate service. 
Non-real-time variable bit rate service. 
Best-efforts service. 


TUE 


All service in 802.16 is connection-oriented, and each connection gets one of the above classes 
of service, determined when the connection is set up. This design is very different from that of 
802.11 or Ethernet, which have no connections in the MAC sublayer. 


Constant bit rate service is intended for transmitting uncompressed voice such as on a T1 
channel. This service needs to send a predetermined amount of data at predetermined time 
intervals. It is accommodated by dedicating certain time slots to each connection of this type. 
Once the bandwidth has been allocated, the time slots are available automatically, without the 
need to ask for each one. 


Real-time variable bit rate service is for compressed multimedia and other soft real-time 
applications in which the amount of bandwidth needed each instant may vary. It is 
accommodated by the base station polling the subscriber at a fixed interval to ask how much 


bandwidth is needed this time. 
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Non-real-time variable bit rate service is for heavy transmissions that are not real time, such 
as large file transfers. For this service the base station polls the subscriber often, but not at 
rigidly-prescribed time intervals. A constant bit rate customer can set a bit in one of its frames 
requesting a poll in order to send additional (variable bit rate) traffic. 


If a station does not respond to a poll k times in a row, the base station puts it into a multicast 
group and takes away its personal poll. Instead, when the multicast group is polled, any of the 
stations in it can respond, contending for service. In this way, stations with little traffic do not 
waste valuable polls. 


Finally, best-efforts service is for everything else. No polling is done and the subscriber must 
contend for bandwidth with other best-efforts subscribers. Requests for bandwidth are done in 
time slots marked in the upstream map as available for contention. If a request is successful, 
its success will be noted in the next downstream map. If it is not successful, unsuccessful 
subscribers have to try again later. To minimize collisions, the Ethernet binary exponential 
backoff algorithm is used. 


The standard defines two forms of bandwidth allocation: per station and per connection. In the 
former case, the subscriber station aggregates the needs of all the users in the building and 

makes collective requests for them. When it is granted bandwidth, it doles out that bandwidth 
to its users as it sees fit. In the latter case, the base station manages each connection directly. 


4.4.5 The 802.16 Frame Structure 


All MAC frames begin with a generic header. The header is followed by an optional payload and 
an optional checksum (CRC), as illustrated in Fig. 4-34. The payload is not needed in control 
frames, for example, those requesting channel slots. The checksum is (surprisingly) also 
optional due to the error correction in the physical layer and the fact that no attempt is ever 
made to retransmit real-time frames. If no retransmissions will be attempted, why even bother 
with a checksum? 


Figure 4-34. (a) A generic frame. (b) A bandwidth request frame. 
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A quick rundown of the header fields of Fig. 4-34(a) is as follows. The EC bit tells whether the 
payload is encrypted. The Type field identifies the frame type, mostly telling whether packing 
and fragmentation are present. The CI field indicates the presence or absence of the final 
checksum. The EK field tells which of the encryption keys is being used (if any). The Length 
field gives the complete length of the frame, including the header. The Connection identifier 
tells which connection this frame belongs to. Finally, the HeaderCRC field is a checksum over 
the header only, using the polynomial x? + x? + x + 1. 


A second header type, for frames that request bandwidth, is shown in Fig. 4-34(b). It starts 
with a 1 bit instead of a O bit and is similar to the generic header except that the second and 
third bytes form a 16-bit number telling how much bandwidth is needed to carry the specified 
number of bytes. Bandwidth request frames do not carry a payload or full-frame CRC. 


A great deal more could be said about 802.16, but this is not the place to say it. For more 
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information, please consult the standard itself. 


4.5 Data Link Layer Switching 


Many organizations have multiple LANs and wish to connect them. LANs can be connected by 
devices called bridges, which operate in the data link layer. Bridges examine the data layer 
link addresses to do routing. Since they are not supposed to examine the payload field of the 
frames they route, they can transport IPv4 (used in the Internet now), IPv6 (will be used in 
the Internet in the future), AppleTalk, ATM, OSI, or any other kinds of packets. In contrast, 
routers examine the addresses in packets and route based on them. Although this seems like a 
clear division between bridges and routers, some modern developments, such as the advent of 
switched Ethernet, have muddied the waters, as we will see later. In the following sections we 
will look at bridges and switches, especially for connecting different 802 LANs. For a 
comprehensive treatment of bridges, switches, and related topics, see (Perlman, 2000). 


Before getting into the technology of bridges, it is worthwhile taking a look at some common 
situations in which bridges are used. We will mention six reasons why a single organization 
may end up with multiple LANs. 


First, many university and corporate departments have their own LANs, primarily to connect 
their own personal computers, workstations, and servers. Since the goals of the various 
departments differ, different departments choose different LANs, without regard to what other 
departments are doing. Sooner or later, there is a need for interaction, so bridges are needed. 
In this example, multiple LANs came into existence due to the autonomy of their owners. 


Second, the organization may be geographically spread over several buildings separated by 
considerable distances. It may be cheaper to have separate LANs in each building and connect 
them with bridges and laser links than to run a single cable over the entire site. 


Third, it may be necessary to split what is logically a single LAN into separate LANs to 
accommodate the load. At many universities, for example, thousands of workstations are 
available for student and faculty computing. Files are normally kept on file server machines 
and are downloaded to users' machines upon request. The enormous scale of this system 
precludes putting all the workstations on a single LAN—the total bandwidth needed is far too 
high. Instead, multiple LANs connected by bridges are used, as shown in Fig. 4-39. Each LAN 
contains a cluster of workstations with its own file server so that most traffic is restricted to a 
single LAN and does not add load to the backbone. 


Figure 4-39. Multiple LANs connected by a backbone to handle a total 
load higher than the capacity of a single LAN. 
ndge 
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It is worth noting that although we usually draw LANs as multidrop cables as in Fig. 4-39 (the 
classic look), they are more often implemented with hubs or especially switches nowadays. 
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However, a long multidrop cable with multiple machines plugged into it and a hub with the 
machines connected inside the hub are functionally identical. In both cases, all the machines 
belong to the same collision domain, and all use the CSMA/CD protocol to send frames. 
Switched LANs are different, however, as we saw before and will see again shortly. 


Fourth, in some situations, a single LAN would be adequate in terms of the load, but the 
physical distance between the most distant machines is too great (e.g., more than 2.5 km for 
Ethernet). Even if laying the cable is easy to do, the network would not work due to the 
excessively long round-trip delay. The only solution is to partition the LAN and install bridges 
between the segments. Using bridges, the total physical distance covered can be increased. 


Fifth, there is the matter of reliability. On a single LAN, a defective node that keeps outputting 
a continuous stream of garbage can cripple the LAN. Bridges can be inserted at critical places, 
like fire doors in a building, to prevent a single node that has gone berserk from bringing down 
the entire system. Unlike a repeater, which just copies whatever it sees, a bridge can be 

programmed to exercise some discretion about what it forwards and what it does not forward. 


Sixth, and last, bridges can contribute to the organization's security. Most LAN interfaces have 
a promiscuous mode, in which a// frames are given to the computer, not just those 
addressed to it. Spies and busybodies love this feature. By inserting bridges at various places 
and being careful not to forward sensitive traffic, a system administrator can isolate parts of 
the network so that its traffic cannot escape and fall into the wrong hands. 


Ideally, bridges should be fully transparent, meaning it should be possible to move a machine 

from one cable segment to another without changing any hardware, software, or configuration 
tables. Also, it should be possible for machines on any segment to communicate with machines 
on any other segment without regard to the types of LANs being used on the two segments or 
on segments in between them. This goal is sometimes achieved, but not always. 


4.5.1 Bridges from 802.x to 802.y 


Having seen why bridges are needed, let us now turn to the question of how they work. Figure 
4-40 illustrates the operation of a simple two-port bridge. Host A on a wireless (802.11) LAN 
has a packet to send to a fixed host, B, on an (802.3) Ethernet to which the wireless LAN is 
connected. The packet descends into the LLC sublayer and acquires an LLC header (shown in 
black in the figure). Then it passes into the MAC sublayer and an 802.11 header is prepended 
to it (also a trailer, not shown in the figure). This unit goes out over the air and is picked up by 
the base station, which sees that it needs to go to the fixed Ethernet. When it hits the bridge 
connecting the 802.11 network to the 802.3 network, it starts in the physical layer and works 


its way upward. In the MAC sublayer in the bridge, the 802.11 header is stripped off. The bare 
packet (with LLC header) is then handed off to the LLC sublayer in the bridge. In this example, 
the packet is destined for an 802.3 LAN, so it works its way down the 802.3 side of the bridge 
and off it goes on the Ethernet. Note that a bridge connecting k different LANs will have k 
different MAC sublayers and k different physical layers, one for each type. 
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Figure 4-40. Operation of a LAN bridge from 802.11 to 802.3. 
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So far it looks like moving a frame from one LAN to another is easy. Such is not the case. In 
this section we will point out some of the difficulties that one encounters when trying to build a 
bridge between the various 802 LANs (and MANs). We will focus on 802.3, 802.11, and 
802.16, but there are others as well, each with its unique problems. 


ss LAN Ethernet 


To start with, each of the LANs uses a different frame format (see Fig. 4-41). Unlike the 
differences between Ethernet, token bus, and token ring, which were due to history and big 
corporate egos, here the differences are to some extent legitimate. For example, the Duration 
field in 802.11 is there due to the MACAW protocol and makes no sense in Ethernet. As a 
result, any copying between different LANs requires reformatting, which takes CPU time, 
requires a new checksum calculation, and introduces the possibility of undetected errors due to 
bad bits in the bridge's memory. 


Figure 4-41. The IEEE 802 frame formats. The drawing is not to scale. 
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A second problem is that interconnected LANs do not necessarily run at the same data rate. 
When forwarding a long run of back-to-back frames from a fast LAN to a slower one, the 
bridge will not be able to get rid of the frames as fast as they come in. For example, if a 
gigabit Ethernet is pouring bits into an 11-Mbps 802.11b LAN at top speed, the bridge will 
have to buffer them, hoping not to run out of memory. Bridges that connect three or more 


LANs have a similar problem when several LANs are trying to feed the same output LAN at the 
same time even if all the LANs run at the same speed. 


A third problem, and potentially the most serious of all, is that different 802 LANs have 
different maximum frame lengths. An obvious problem arises when a long frame must be 
forwarded onto a LAN that cannot accept it. Splitting the frame into pieces is out of the 
question in this layer. All the protocols assume that frames either arrive or they do not. There 
is no provision for reassembling frames out of smaller units. This is not to say that such 
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protocols could not be devised. They could be and have been. It is just that no data link 
protocols provide this feature, so bridges must keep their hands off the frame payload. 
Basically, there is no solution. Frames that are too large to be forwarded must be discarded. 
So much for transparency. 


Another point is security. Both 802.11 and 802.16 support encryption in the data link layer. 
Ethernet does not. This means that the various encryption services available to the wireless 
networks are lost when traffic passes over an Ethernet. Worse yet, if a wireless station uses 
data link layer encryption, there will be no way to decrypt it when it arrives over an Ethernet. 
If the wireless station does not use encryption, its traffic will be exposed over the air link. 
Either way there is a problem. 


One solution to the security problem is to do encryption in a higher layer, but then the 802.11 
station has to know whether it is talking to another station on an 802.11 network (meaning 
use data link layer encryption) or not (meaning do not use it). Forcing the station to make a 
choice destroys transparency. 


A final point is quality of service. Both 802.11 and 802.16 provide it in various forms, the 
former using PCF mode and the latter using constant bit rate connections. Ethernet has no 
concept of quality of service, so traffic from either of the others will lose its quality of service 
when passing over an Ethernet. 


4.5.2 Local Internetworking 


The previous section dealt with the problems encountered in connecting two different IEEE 802 
LANs via a single bridge. However, in large organizations with many LANs, just interconnecting 
them all raises a variety of issues, even if they are all just Ethernet. Ideally, it should be 
possible to go out and buy bridges designed to the IEEE standard, plug the connectors into the 
bridges, and everything should work perfectly, instantly. There should be no hardware changes 
required, no software changes required, no setting of address switches, no downloading of 
routing tables or parameters, nothing. Just plug in the cables and walk away. Furthermore, the 
operation of the existing LANs should not be affected by the bridges at all. In other words, the 
bridges should be completely transparent (invisible to all the hardware and software). 
Surprisingly enough, this is actually possible. Let us now take a look at how this magic is 
accomplished. 


In its simplest form, a transparent bridge operates in promiscuous mode, accepting every 
frame transmitted on all the LANs to which it is attached. As an example, consider the 
configuration of Fig. 4-42. Bridge B1 is connected to LANs 1 and 2, and bridge B2 is connected 
to LANs 2, 3, and 4. A frame arriving at bridge B1 on LAN 1 destined for A can be discarded 
immediately, because it is already on the correct LAN, but a frame arriving on LAN 1 for C or F 
must be forwarded. 


Figure 4-42. A configuration with four LANs and two bridges. 
i| [H] 


LAN 1 LAN 2 LAN 3 


When a frame arrives, a bridge must decide whether to discard or forward it, and if the latter, 
on which LAN to put the frame. This decision is made by looking up the destination address in 
a big (hash) table inside the bridge. The table can list each possible destination and tell which 
output line (LAN) it belongs on. For example, B2's table would list A as belonging to LAN 2, 
since all B2 has to know is which LAN to put frames for A on. That, in fact, more forwarding 
happens later is not of interest to it. 
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When the bridges are first plugged in, all the hash tables are empty. None of the bridges know 
where any of the destinations are, so they use a flooding algorithm: every incoming frame for 
an unknown destination is output on all the LANs to which the bridge is connected except the 
one it arrived on. As time goes on, the bridges learn where destinations are, as described 
below. Once a destination is known, frames destined for it are put on only the proper LAN and 
are not flooded. 


The algorithm used by the transparent bridges is backward learning.As mentioned above, 
the bridges operate in promiscuous mode, so they see every frame sent on any of their LANs. 
By looking at the source address, they can tell which machine is accessible on which LAN. For 
example, if bridge B1 in Fig. 4-42 sees a frame on LAN 2 coming from C, it knows that C must 
be reachable via LAN 2, so it makes an entry in its hash table noting that frames going to C 
should use LAN 2. Any subsequent frame addressed to C coming in on LAN 1 will be forwarded, 
but a frame for C coming in on LAN 2 will be discarded. 


The topology can change as machines and bridges are powered up and down and moved 
around. To handle dynamic topologies, whenever a hash table entry is made, the arrival time 
of the frame is noted in the entry. Whenever a frame whose source is already in the table 
arrives, its entry is updated with the current time. Thus, the time associated with every entry 
tells the last time a frame from that machine was seen. 


Periodically, a process in the bridge scans the hash table and purges all entries more than a 
few minutes old. In this way, if a computer is unplugged from its LAN, moved around the 
building, and plugged in again somewhere else, within a few minutes it will be back in normal 
operation, without any manual intervention. This algorithm also means that if a machine is 
quiet for a few minutes, any traffic sent to it will have to be flooded until it next sends a frame 
itself. 


The routing procedure for an incoming frame depends on the LAN it arrives on (the source 
LAN) and the LAN its destination is on (the destination LAN), as follows: 


1. If destination and source LANs are the same, discard the frame. 
2. If the destination and source LANs are different, forward the frame. 
3. Ifthe destination LAN is unknown, use flooding. 


As each frame arrives, this algorithm must be applied. Special-purpose VLSI chips do the 
lookup and update the table entry, all in a few microseconds. 


4.5.3 Spanning Tree Bridges 


To increase reliability, some sites use two or more bridges in parallel between pairs of LANs, as 
shown in Fig. 4-43. This arrangement, however, also introduces some additional problems 
because it creates loops in the topology. 


Figure 4-43. Two parallel transparent bridges. 
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A simple example of these problems can be seen by observing how a frame, F, with unknown 
destination is handled in Fig. 4-43. Each bridge, following the normal rules for handling 
unknown destinations, uses flooding, which in this example just means copying it to LAN 2. 
Shortly thereafter, bridge 1 sees F2, a frame with an unknown destination, which it copies to 
LAN 1, generating F3 (not shown). Similarly, bridge 2 copies Fi to LAN 1 generating Fa (also 
not shown). Bridge 1 now forwards F4 and bridge 2 copies Fs. This cycle goes on forever. 


The solution to this difficulty is for the bridges to communicate with each other and overlay the 
actual topology with a spanning tree that reaches every LAN. In effect, some potential 
connections between LANs are ignored in the interest of constructing a fictitious loop-free 
topology. For example, in Fig. 4-44(a) we see nine LANs interconnected by ten bridges. This 
configuration can be abstracted into a graph with the LANs as the nodes. An arc connects any 
two LANs that are connected by a bridge. The graph can be reduced to a spanning tree by 
dropping the arcs shown as dotted lines in Fig. 4-44(b). Using this spanning tree, there is 
exactly one path from every LAN to every other LAN. Once the bridges have agreed on the 
spanning tree, all forwarding between LANs follows the spanning tree. Since there is a unique 
path from each source to each destination, loops are impossible. 


Figure 4-44. (a) Interconnected LANs. (b) A spanning tree covering 
the LANs. The dotted lines are not part of the spanning tree. 
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To build the spanning tree, first the bridges have to choose one bridge to be the root of the 
tree. They make this choice by having each one broadcast its serial number, installed by the 
manufacturer and guaranteed to be unique worldwide. The bridge with the lowest serial 
number becomes the root. Next, a tree of shortest paths from the root to every bridge and 
LAN is constructed. This tree is the spanning tree. If a bridge or LAN fails, a new one is 
computed. 


The result of this algorithm is that a unique path is established from every LAN to the root and 
thus to every other LAN. Although the tree spans all the LANs, not all the bridges are 
necessarily present in the tree (to prevent loops). Even after the spanning tree has been 
established, the algorithm continues to run during normal operation in order to automatically 
detect topology changes and update the tree. The distributed algorithm used for constructing 
the spanning tree was invented by Radia Perlman and is described in detail in (Perlman, 2000). 
It is standardized in IEEE 802.1D. 


4.5.4 Remote Bridges 


A common use of bridges is to connect two (or more) distant LANs. For example, a company 
might have plants in several cities, each with its own LAN. Ideally, all the LANs should be 
interconnected, so the complete system acts like one large LAN. 


This goal can be achieved by putting a bridge on each LAN and connecting the bridges pairwise 
with point-to-point lines (e.g., lines leased from a telephone company). A simple system, with 
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three LANs, is illustrated in Fig. 4-45. The usual routing algorithms apply here. The simplest 
way to see this is to regard the three point-to-point lines as hostless LANs. Then we have a 
normal system of six LANS interconnected by four bridges. Nothing in what we have studied so 
far says that a LAN must have hosts on it. 


Figure 4-45. Remote bridges can be used to interconnect distant LANs. 
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Various protocols can be used on the point-to-point lines. One possibility is to choose some 
standard point-to-point data link protocol such as PPP, putting complete MAC frames in the 
payload field. This strategy works best if all the LANs are identical, and the only problem is 
getting frames to the correct LAN. Another option is to strip off the MAC header and trailer at 
the source bridge and put what is left in the payload field of the point-to-point protocol. A new 
MAC header and trailer can then be generated at the destination bridge. A disadvantage of this 
approach is that the checksum that arrives at the destination host is not the one computed by 
the source host, so errors caused by bad bits in a bridge's memory may not be detected. 


4.5.5 Repeaters, Hubs, Bridges, Switches, Routers, and Gateways 


So far in this book we have looked at a variety of ways to get frames and packets from one 
cable segment to another. We have mentioned repeaters, bridges, switches, hubs, routers, and 
gateways. All of these devices are in common use, but they all differ in subtle and not-so- 
subtle ways. Since there are so many of them, it is probably worth taking a look at them 
together to see what the similarities and differences are. 


To start with, these devices operate in different layers, as illustrated in Fig. 4-46(a). The layer 
matters because different devices use different pieces of information to decide how to switch. 
In a typical scenario, the user generates some data to be sent to a remote machine. Those 
data are passed to the transport layer, which then adds a header, for example, a TCP header, 
and passes the resulting unit down to the network layer. The network layer adds its own 
header to form a network layer packet, for example, an IP packet. In Fig. 4-46(b) we see the 
IP packet shaded in gray. Then the packet goes to the data link layer, which adds its own 
header and checksum (CRC) and gives the resulting frame to the physical layer for 
transmission, for example, over a LAN. 


Figure 4-46. (a) Which device is in which layer. (b) Frames, packets, 
and headers. 
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Now let us look at the switching devices and see how they relate to the packets and frames. At 
the bottom, in the physical layer, we find the repeaters. These are analog devices that are 
connected to two cable segments. A signal appearing on one of them is amplified and put out 
on the other. Repeaters do not understand frames, packets, or headers. They understand 
volts. Classic Ethernet, for example, was designed to allow four repeaters, in order to extend 
the maximum cable length from 500 meters to 2500 meters. 


Next we come to the hubs. A hub has a number of input lines that it joins electrically. Frames 
arriving on any of the lines are sent out on all the others. If two frames arrive at the same 
time, they will collide, just as on a coaxial cable. In other words, the entire hub forms a single 
collision domain. All the lines coming into a hub must operate at the same speed. Hubs differ 
from repeaters in that they do not (usually) amplify the incoming signals and are designed to 
hold multiple line cards each with multiple inputs, but the differences are slight. Like repeaters, 
hubs do not examine the 802 addresses or use them in any way. A hub is shown in Fig. 4- 


47(a). 
Figure 4-47. (a) A hub. (b) A bridge. (c) A switch. 
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Now let us move up to the data link layer where we find bridges and switches. We just studied 
bridges at some length. A bridge connects two or more LANs, as shown in Fig. 4-47(b). When 
a frame arrives, software in the bridge extracts the destination address from the frame header 


and looks it up in a table to see where to send the frame. For Ethernet, this address is the 48- 
bit destination address shown in Fig. 4-17. Like a hub, a modern bridge has line cards, usually 
for four or eight input lines of a certain type. A line card for Ethernet cannot handle, say, token 
ring frames, because it does not know where to find the destination address in the frame 
header. However, a bridge may have line cards for different network types and different 
speeds. With a bridge, each line is its own collision domain, in contrast to a hub. 


Switches are similar to bridges in that both route on frame addresses. In fact, many people 
uses the terms interchangeably. The main difference is that a switch is most often used to 
connect individual computers, as shown in Fig. 4-47(c). As a consequence, when host A in Fig. 
4-47(b) wants to send a frame to host B, the bridge gets the frame but just discards it. In 
contrast, in Fig. 4-47(c), the switch must actively forward the frame from A to B because there 
is no other way for the frame to get there. Since each switch port usually goes to a single 
computer, switches must have space for many more line cards than do bridges intended to 
connect only LANs. Each line card provides buffer space for frames arriving on its ports. Since 
each port is its own collision domain, switches never lose frames to collisions. However, if 
frames come in faster than they can be retransmitted, the switch may run out of buffer space 
and have to start discarding frames. 


To alleviate this problem slightly, modern switches start forwarding frames as soon as the 
destination header field has come in, but before the rest of the frame has arrived (provided the 
output line is available, of course). These switches do not use store-and-forward switching. 
Sometimes they are referred to as cut-through switches. Usually, cut-through is handled 
entirely in hardware, whereas bridges traditionally contained an actual CPU that did store-and- 
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forward switching in software. But since all modern bridges and switches contain special 
integrated circuits for switching, the difference between a switch and bridge is more a 
marketing issue than a technical one. 


So far we have seen repeaters and hubs, which are quite similar, as well as bridges and 
switches, which are also very similar to each other. Now we move up to routers, which are 
different from all of the above. When a packet comes into a router, the frame header and 
trailer are stripped off and the packet located in the frame's payload field (shaded in Fig. 4-46) 
is passed to the routing software. This software uses the packet header to choose an output 
line. For an IP packet, the packet header will contain a 32-bit (IPv4) or 128-bit (IPv6) address, 
but not a 48-bit 802 address. The routing software does not see the frame addresses and does 
not even know whether the packet came in on a LAN or a point-to-point line. We will study 
routers and routing in Chap. 5. 


Up another layer we find transport gateways. These connect two computers that use different 
connection-oriented transport protocols. For example, suppose a computer using the 
connection-oriented TCP/IP protocol needs to talk to a computer using the connection-oriented 
ATM transport protocol. The transport gateway can copy the packets from one connection to 
the other, reformatting them as need be. 


Finally, application gateways understand the format and contents of the data and translate 
messages from one format to another. An e-mail gateway could translate Internet messages 
into SMS messages for mobile phones, for example. 


4.5.6 Virtual LANs 


In the early days of local area networking, thick yellow cables snaked through the cable ducts 
of many office buildings. Every computer they passed was plugged in. Often there were many 
cables, which were connected to a central backbone (as in Fig. 4-39) or to a central hub. No 
thought was given to which computer belonged on which LAN. AII the people in adjacent offices 
were put on the same LAN whether they belonged together or not. Geography trumped logic. 


With the advent of 10Base-T and hubs in the 1990s, all that changed. Buildings were rewired 
(at considerable expense) to rip out all the yellow garden hoses and install twisted pairs from 
every office to central wiring closets at the end of each corridor or in a central machine room, 
as illustrated in Fig. 4-48. If the Vice President in Charge of Wiring was a visionary, category 5 
twisted pairs were installed; if he was a bean counter, the existing (category 3) telephone 
wiring was used (only to be replaced a few years later when fast Ethernet emerged). 


Figure 4-48. A building with centralized wiring using hubs and a 
switch. 
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With hubbed (and later, switched) Ethernet, it was often possible to configure LANs logically 
rather than physically. If a company wants k LANs, it buys k hubs. By carefully choosing which 
connectors to plug into which hubs, the occupants of a LAN can be chosen in a way that makes 
organizational sense, without too much regard to geography. Of course, if two people in the 
same department work in different buildings, they are probably going to be on different hubs 
and thus different LANs. Nevertheless, the situation is a lot better than having LAN 
membership entirely based on geography. 


Does it matter who is on which LAN? After all, in virtually all organizations, all the LANs are 
interconnected. In short, yes, it often matters. Network administrators like to group users on 
LANs to reflect the organizational structure rather than the physical layout of the building for a 
variety of reasons. One issue is security. Any network interface can be put in promiscuous 
mode, copying all the traffic that comes down the pipe. Many departments, such as research, 
patents, and accounting, have information that they do not want passed outside their 
department. In such a situation, putting all the people in a department on a single LAN and not 
letting any of that traffic off the LAN makes sense. Management does not like hearing that 
such an arrangement is impossible unless all the people in each department are located in 
adjacent offices with no interlopers. 


A second issue is load. Some LANs are more heavily used than others and it may be desirable 
to separate them at times. For example, if the folks in research are running all kinds of nifty 
experiments that sometimes get out of hand and saturate their LAN, the folks in accounting 
may not be enthusiastic about donating some of their capacity to help out. 


A third issue is broadcasting. Most LANs support broadcasting, and many upper-layer protocols 
use this feature extensively. For example, when a user wants to send a packet to an IP 
address x, how does it know which MAC address to put in the frame? We will study this 
question in Chap. 5, but briefly summarized, the answer is that it broadcasts a frame 
containing the question: Who owns IP address x? Then it waits for an answer. And there are 


many more examples of where broadcasting is used. As more and more LANs get 
interconnected, the number of broadcasts passing each machine tends to increase linearly with 
the number of machines. 


Related to broadcasts is the problem that once in a while a network interface will break down 
and begin generating an endless stream of broadcast frames. The result of this broadcast 
storm is that (1) the entire LAN capacity is occupied by these frames, and (2) all the machines 
on all the interconnected LANs are crippled just processing and discarding all the frames being 
broadcast. 


At first it might appear that broadcast storms could be limited in scope by separating the LANs 
with bridges or switches, but if the goal is to achieve transparency (i.e., a machine can be 
moved to a different LAN across the bridge without anyone noticing it), then bridges have to 
forward broadcast frames. 


Having seen why companies might want multiple LANs with restricted scope, let us get back to 
the problem of decoupling the logical topology from the physical topology. Suppose that a user 
gets shifted within the company from one department to another without changing offices or 
changes offices without changing departments. With hubbed wiring, moving the user to the 
correct LAN means having the network administrator walk down to the wiring closet and pull 
the connector for the user's machine from one hub and put it into a new hub. 


In many companies, organizational changes occur all the time, meaning that system 
administrators spend a lot of time pulling out plugs and pushing them back in somewhere else. 
Also, in some cases, the change cannot be made at all because the twisted pair from the user's 
machine is too far from the correct hub (e.g., in the wrong building). 


In response to user requests for more flexibility, network vendors began working on a way to 
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rewire buildings entirely in software. The resulting concept is called a VLAN (Virtual LAN) and 
has even been standardized by the 802 committee. It is now being deployed in many 
organizations. Let us now take a look at it. For additional information about VLANs, see 
(Breyer and Riley, 1999; and Seifert, 2000). 


VLANs are based on specially-designed VLAN-aware switches, although they may also have 
some hubs on the periphery, as in Fig. 4-48. To set up a VLAN-based network, the network 
administrator decides how many VLANs there will be, which computers will be on which VLAN, 
and what the VLANs will be called. Often the VLANs are (informally) named by colors, since it 
is then possible to print color diagrams showing the physical layout of the machines, with the 
members of the red LAN in red, members of the green LAN in green, and so on. In this way, 
both the physical and logical layouts are visible in a single view. 


As an example, consider the four LANs of Fig. 4-49(a), in which eight of the machines belong 
to the G (gray) VLAN and seven of them belong to the W (white) VLAN. The four physical LANs 
are connected by two bridges, B1 and B2. If centralized twisted pair wiring is used, there 
might also be four hubs (not shown), but logically a multidrop cable and a hub are the same 
thing. Drawing it this way just makes the figure a little less cluttered. Also, the term "bridge" 
tends to be used nowadays mostly when there are multiple machines on each port, as in this 
figure, but otherwise, "bridge" and "switch" are essentially interchangeable. Fig. 4-49(b) 
shows the same machines and same VLANs using switches with a single computer on each 
port. 


Figure 4-49. (a) Four physical LANs organized into two VLANs, gray 
and white, by two bridges. (b) The same 15 machines organized into 
two VLANs by switches. 


(b) 


To make the VLANs function correctly, configuration tables have to be set up in the bridges or 
switches. These tables tell which VLANs are accessible via which ports (lines). When a frame 
comes in from, say, the gray VLAN, it must be forwarded on all the ports marked G. This holds 
for ordinary (i.e., unicast) traffic as well as for multicast and broadcast traffic. 


Note that a port may be labeled with multiple VLAN colors. We see this most clearly in Fig. 4- 
49(a). Suppose that machine A broadcasts a frame. Bridge B1 receives the frame and sees 
that it came from a machine on the gray VLAN, so it forwards it on all ports labeled G (except 
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the incoming port). Since B1 has only two other ports and both of them are labeled G, the 
frame is sent to both of them. 


At B2 the story is different. Here the bridge knows that there are no gray machines on LAN 4, 
so the frame is not forwarded there. It goes only to LAN 2. If one of the users on LAN 4 should 
change departments and be moved to the gray VLAN, then the tables inside B2 have to be 
updated to relabel that port as GW instead of W. If machine F goes gray, then the port to LAN 
2 has to be changed to G instead of GW. 


Now let us imagine that all the machines on both LAN 2 and LAN 4 become gray. Then not only 
do B2's ports to LAN 2 and LAN 4 get marked G, but B1's port to B2 also has to change from 
GW to G since white frames arriving at B1 from LANs 1 and 3 no longer have to be forwarded 
to B2. In Fig. 4-49(b) the same situation holds, only here all the ports that go to a single 
machine are labeled with a single color because only one VLAN is out there. 


So far we have assumed that bridges and switches somehow know what color an incoming 
frame is. How do they know this? Three methods are in use, as follows: 


1. Every port is assigned a VLAN color. 
2. Every MAC address is assigned a VLAN color. 
3. Every layer 3 protocol or IP address is assigned a VLAN color. 


In the first method, each port is labeled with VLAN color. However, this method only works if 
all machines on a port belong to the same VLAN. In Fig. 4-49(a), this property holds for B1 for 
the port to LAN 3 but not for the port to LAN 1. 


In the second method, the bridge or switch has a table listing the 48-bit MAC address of each 
machine connected to it along with the VLAN that machine is on. Under these conditions, it is 
possible to mix VLANs on a physical LAN, as in LAN 1 in Fig. 4-49(a). When a frame arrives, all 
the bridge or switch has to do is to extract the MAC address and look it up in a table to see 
which VLAN the frame came from. 


The third method is for the bridge or switch to examine the payload field of the frame, for 
example, to classify all IP machines as belonging to one VLAN and all AppleTalk machines as 
belonging to another. For the former, the IP address can also be used to identify the machine. 


This strategy is most useful when many machines are notebook computers that can be docked 
in any one of several places. Since each docking station has its own MAC address, just knowing 
which docking station was used does not say anything about which VLAN the notebook is on. 


The only problem with this approach is that it violates the most fundamental rule of 
networking: independence of the layers. It is none of the data link layer's business what is in 
the payload field. It should not be examining the payload and certainly not be making 
decisions based on the contents. A consequence of using this approach is that a change to the 
layer 3 protocol (for example, an upgrade from IPv4 to IPv6) suddenly causes the switches to 
fail. Unfortunately, switches that work this way are on the market. 


Of course, there is nothing wrong with routing based on IP addresses— nearly all of Chap. 5 is 
devoted to IP routing—but mixing the layers is looking for trouble. A switch vendor might 
pooh-pooh this argument saying that its switches understand both IPv4 and IPv6, so 
everything is fine. But what happens when IPv7 happens? The vendor would probably say: Buy 
new switches, is that so bad? 
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The IEEE 802.1Q Standard 


Some more thought on this subject reveals that what actually matters is the VLAN of the frame 
itself, not the VLAN of the sending machine. If there were some way to identify the VLAN in 
the frame header, then the need to inspect the payload would vanish. For a new LAN, such as 
802.11 or 802.16, it would have been easy enough to just add a VLAN field in the header. In 
fact, the Connection Identifier field in 802.16 is somewhat similar in spirit to a VLAN identifier. 
But what to do about Ethernet, which is the dominant LAN, and does not have any spare fields 
lying around for the VLAN identifier? 


The IEEE 802 committee had this problem thrown into its lap in 1995. After much discussion, it 
did the unthinkable and changed the Ethernet header. The new format was published in IEEE 
standard 802.1Q, issued in 1998. The new format contains a VLAN tag; we will examine it 
shortly. Not surprisingly, changing something as well established as the Ethernet header is not 
entirely trivial. A few questions that come to mind are: 


1. Need we throw out several hundred million existing Ethernet cards? 
2. If not, who generates the new fields? 
3. What happens to frames that are already the maximum size? 


Of course, the 802 committee was (only too painfully) aware of these problems and had to 
come up with solutions, which it did. 


The key to the solution is to realize that the VLAN fields are only actually used by the bridges 
and switches and not by the user machines. Thus in Fig. 4-49, it is not really essential that 
they are present on the lines going out to the end stations as long as they are on the line 
between the bridges or switches. Thus, to use VLANs, the bridges or switches have to be VLAN 
aware, but that was already a requirement. Now we are only introducing the additional 
requirement that they are 802.1Q aware, which new ones already are. 


As to throwing out all existing Ethernet cards, the answer is no. Remember that the 802.3 
committee could not even get people to change the Type field into a Length field. You can 
imagine the reaction to an announcement that all existing Ethernet cards had to be thrown 
out. However, as new Ethernet cards come on the market, the hope is that they will be 802.1Q 
compliant and correctly fill in the VLAN fields. 


So if the originator does not generate the VLAN fields, who does? The answer is that the first 
VLAN-aware bridge or switch to touch a frame adds them and the last one down the road 
removes them. But how does it know which frame belongs to which VLAN? Well, the first 


bridge or switch could assign a VLAN number to a port, look at the MAC address, or (heaven 
forbid) examine the payload. Until Ethernet cards are all 802.1Q compliant, we are kind of 
back where we started. The real hope here is that all gigabit Ethernet cards will be 802.1Q 
compliant from the start and that as people upgrade to gigabit Ethernet, 802.1Q will be 
introduced automatically. As to the problem of frames longer than 1518 bytes, 802.1Q just 
raised the limit to 1522 bytes. 


During the transition process, many installations will have some legacy machines (typically 
classic or fast Ethernet) that are not VLAN aware and others (typically gigabit Ethernet) that 
are. This situation is illustrated in Fig. 4-50, where the shaded symbols are VLAN aware and 
the empty ones are not. For simplicity, we assume that all the switches are VLAN aware. If this 
is not the case, the first VLAN-aware switch can add the tags based on MAC or IP addresses. 
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Figure 4-50. Transition from legacy Ethernet to VLAN-aware Ethernet. 
The shaded symbols are VLAN aware. The empty ones are not. 
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In this figure, VLAN-aware Ethernet cards generate tagged (i.e., 802.1Q) frames directly, and 
further switching uses these tags. To do this switching, the switches have to know which 
VLANs are reachable on each port, just as before. Knowing that a frame belongs to the gray 
VLAN does not help much until the switch knows which ports connect to machines on the gray 
VLAN. Thus, the switch needs a table indexed by VLAN telling which ports to use and whether 
they are VLAN aware or legacy. 


When a legacy PC sends a frame to a VLAN-aware switch, the switch builds a new tagged 
frame based on its knowledge of the sender's VLAN (using the port, MAC address, or IP 
address). From that point on, it no longer matters that the sender was a legacy machine. 
Similarly, a switch that needs to deliver a tagged frame to a legacy machine has to reformat 
the frame in the legacy format before delivering it. 


Now let us take a look at the 802.1Q frame format. It is shown in Fig. 4-51. The only change is 
the addition of a pair of 2-byte fields. The first one is the VLAN protocol ID. It always has the 
value 0x8100. Since this number is greater than 1500, all Ethernet cards interpret it as a type 
rather than a length. What a legacy card does with such a frame is moot since such frames are 
not supposed to be sent to legacy cards. 


Figure 4-51. The 802.3 (legacy) and 802.1Q Ethernet frame formats. 
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The second 2-byte field contains three subfields. The main one is the VLAN identifier, 
occupying the low-order 12 bits. This is what the whole thing is about—which VLAN does the 
frame belong to? The 3-bit Priority field has nothing to do with VLANs at all, but since changing 
the Ethernet header is a once-in-a-decade event taking three years and featuring a hundred 
people, why not put in some other good things while you are at it? This field makes it possible 
to distinguish hard real-time traffic from soft real-time traffic from time-insensitive traffic in 
order to provide better quality of service over Ethernet. It is needed for voice over Ethernet 
(although in all fairness, IP has had a similar field for a quarter of a century and nobody ever 
used it). 
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The last bit, CFI (Canonical Format Indicator) should have been called the CEI (Corporate Ego 
Indicator). It was originally intended to indicate little-endian MAC addresses versus big-endian 
MAC addresses, but that use got lost in other controversies. Its presence now indicates that 
the payload contains a freeze-dried 802.5 frame that is hoping to find another 802.5 LAN at 
the destination while being carried by Ethernet in between. This whole arrangement, of course, 
has nothing whatsoever to do with VLANs. But standards' committee politics is not unlike 
regular politics: if you vote for my bit, I will vote for your bit. 


As we mentioned above, when a tagged frame arrives at a VLAN-aware switch, the switch uses 
the VLAN ID as an index into a table to find out which ports to send it on. But where does the 
table come from? If it is manually constructed, we are back to square zero: manual 
configuration of bridges. The beauty of the transparent bridge is that it is plug-and-play and 
does not require any manual configuration. It would be a terrible shame to lose that property. 
Fortunately, VLAN-aware bridges can also autoconfigure themselves based on observing the 
tags that come by. If a frame tagged as VLAN 4 comes in on port 3, then apparently some 
machine on port 3 is on VLAN 4. The 802.1Q standard explains how to build the tables 
dynamically, mostly by referencing appropriate portions of Perlman's algorithm standardized in 
802.1D. 


Before leaving the subject of VLAN routing, it is worth making one last observation. Many 
people in the Internet and Ethernet worlds are fanatically in favor of connectionless networking 
and violently opposed to anything smacking of connections in the data link or network layers. 
Yet VLANs introduce something that is surprisingly similar to a connection. To use VLANs 
properly, each frame carries a new special identifier that is used as an index into a table inside 
the switch to look up where the frame is supposed to be sent. That is precisely what happens 
in connection-oriented networks. In connectionless networks, it is the destination address that 
is used for routing, not some kind of connection identifier. We will see more of this creeping 
connectionism in Chap. 5. 


4.6 Summary 


Some networks have a single channel that is used for all communication. In these networks, 
the key design issue is the allocation of this channel among the competing stations wishing to 
use it. Numerous channel allocation algorithms have been devised. A summary of some of the 
more important channel allocation methods is given in Fig. 4-52. 


Figure 4-52. Channel allocation methods and systems for a common 


channel. 

Method | Description 
FDM Dedicate a frequency band to each station 
WDM | A dynamic FDM scheme for fiber 
TOM , Dedicate a time slot to each station 
Pure ALOHA , Unsynchronized transmission at any instant 
Slotted ALOHA _ Random transmission in well-defined time slots 
1-persistent CSMA Standard carrier sense multiple access 
Nonpersistent CSMA | Random delay when channel is sensed busy 
P-persistent CSMA | CSMA, but with a probability of p of persisting 
CSMA/CD CSMA, but abort on detecting a collision 
Bit map , Round-robin scheduling using a bit map 
Binary countdown , Highest-numbered ready station goes next 
Tree walk _ Reduced contention by selective enabling 
MACA, MACAW Wireless LAN protocols 
Ethernet , CSMA/CD with binary exponential backoff 
FHSS Frequency hopping spread spectrum 
DSSS , Direct sequence spread spectrum 
CSMA/CA Carrier sense multiple access with collision avoidance 


The simplest allocation schemes are FDM and TDM. These are efficient when the number of 
stations is small and fixed and the traffic is continuous. Both are widely used under these 
circumstances, for example, for dividing up the bandwidth on telephone trunks. 


When the number of stations is large and variable or the traffic is fairly bursty, FDM and TDM 
are poor choices. The ALOHA protocol, with and without slotting, has been proposed as an 
alternative. ALOHA and its many variants and derivatives have been widely discussed, 
analyzed, and used in real systems. 


When the state of the channel can be sensed, stations can avoid starting a transmission while 
another station is transmitting. This technique, carrier sensing, has led to a variety of protocols 
that can be used on LANs and MANs. 


A class of protocols that eliminates contention altogether, or at least reduce it considerably, is 
well known. Binary countdown completely eliminates contention. The tree walk protocol 
reduces it by dynamically dividing the stations into two disjoint groups, one of which is 
permitted to transmit and one of which is not. It tries to make the division in such a way that 
only one station that is ready to send is permitted to do so. 


Wireless LANs have their own problems and solutions. The biggest problem is caused by 
hidden stations, so CSMA does not work. One class of solutions, typified by MACA and MACAW, 
attempts to stimulate transmissions around the destination, to make CSMA work better. 
Frequency hopping spread spectrum and direct sequence spread spectrum are also used. IEEE 
802.11 combines CSMA and MACAW to produce CSMA/CA. 


Ethernet is the dominant form of local area networking. It uses CSMA/CD for channel 
allocation. Older versions used a cable that snaked from machine to machine, but now twisted 
pairs to hubs and switches are most common. Speeds have risen from 10 Mbps to 1 Gbps and 
are still rising. 


Wireless LANs are becoming common, with 802.11 dominating the field. Its physical layer 
allows five different transmission modes, including infrared, various spread spectrum schemes, 
and a multichannel FDM system. It can operate with a base station in each cell, but it can also 
operate without one. The protocol is a variant of MACAW, with virtual carrier sensing. 


Wireless MANs are starting to appear. These are broadband systems that use radio to replace 
the last mile on telephone connections. Traditional narrowband modulation techniques are 
used. Quality of service is important, with the 802.16 standard defining four classes (constant 
bit rate, two variable bit rate, and one best efforts). 


The Bluetooth system is also wireless but aimed more at the desktop, for connecting headsets 
and other peripherals to computers without wires. It is also intended to connect peripherals, 
such as fax machines, to mobile telephones. Like 801.11, it uses frequency hopping spread 
spectrum in the ISM band. Due to the expected noise level of many environments and need for 
real-time interaction, elaborate forward error correction is built into its various protocols. 


With so many different LANs, a way is needed to interconnect them all. Bridges and switches 
are used for this purpose. The spanning tree algorithm is used to build plug-and-play bridges. 
A new development in the LAN interconnection world is the VLAN, which separates the logical 
topology of the LANs from their physical topology. A new format for Ethernet frames (802.1Q) 
has been introduced to ease the introduction of VLANs into organizations. 
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Chapter 5. The Network Layer 


The network layer is concerned with getting packets from the source all the way to the 
destination. Getting to the destination may require making many hops at intermediate routers 
along the way. This function clearly contrasts with that of the data link layer, which has the 
more modest goal of just moving frames from one end of a wire to the other. Thus, the 
network layer is the lowest layer that deals with end-to-end transmission. 


To achieve its goals, the network layer must know about the topology of the communication 
subnet (i.e., the set of all routers) and choose appropriate paths through it. It must also take 
care to choose routes to avoid overloading some of the communication lines and routers while 
leaving others idle. Finally, when the source and destination are in different networks, new 
problems occur. It is up to the network layer to deal with them. In this chapter we will study 
all these issues and illustrate them, primarily using the Internet and its network layer protocol, 
IP, although wireless networks will also be addressed. 


5.1 Network Layer Design Issues 


In the following sections we will provide an introduction to some of the issues that the 
designers of the network layer must grapple with. These issues include the service provided to 
the transport layer and the internal design of the subnet. 


5.1.1 Store-and-Forward Packet Switching 


But before starting to explain the details of the network layer, it is probably worth restating 
the context in which the network layer protocols operate. This context can be seen in Fig. 5-1. 
The major components of the system are the carrier's equipment (routers connected by 
transmission lines), shown inside the shaded oval, and the customers' equipment, shown 
outside the oval. Host H1 is directly connected to one of the carrier's routers, A, by a leased 
line. In contrast, H2 is on a LAN with a router, F, owned and operated by the customer. This 
router also has a leased line to the carrier's equipment. We have shown F as being outside the 
oval because it does not belong to the carrier, but in terms of construction, software, and 
protocols, it is probably no different from the carrier's routers. Whether it belongs to the 
subnet is arguable, but for the purposes of this chapter, routers on customer premises are 
considered part of the subnet because they run the same algorithms as the carrier's routers 
(and our main concern here is algorithms). 


Figure 5-1. The environment of the network layer protocols. 


Router Carrier's equipment 


This equipment is used as follows. A host with a packet to send transmits it to the nearest 
router, either on its own LAN or over a point-to-point link to the carrier. The packet is stored 
there until it has fully arrived so the checksum can be verified. Then it is forwarded to the next 
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router along the path until it reaches the destination host, where it is delivered. This 
mechanism is store-and-forward packet switching, as we have seen in previous chapters. 


5.1.2 Services Provided to the Transport Layer 


The network layer provides services to the transport layer at the network layer/transport layer 
interface. An important question is what kind of services the network layer provides to the 
transport layer. The network layer services have been designed with the following goals in 
mind. 


1. The services should be independent of the router technology. 

2. The transport layer should be shielded from the number, type, and topology of the 
routers present. 

3. The network addresses made available to the transport layer should use a uniform 
numbering plan, even across LANs and WANs. 


Given these goals, the designers of the network layer have a lot of freedom in writing detailed 
specifications of the services to be offered to the transport layer. This freedom often 
degenerates into a raging battle between two warring factions. The discussion centers on 
whether the network layer should provide connection-oriented service or connectionless 
service. 


One camp (represented by the Internet community) argues that the routers' job is moving 
packets around and nothing else. In their view (based on 30 years of actual experience with a 
real, working computer network), the subnet is inherently unreliable, no matter how it is 
designed. Therefore, the hosts should accept the fact that the network is unreliable and do 
error control (i.e., error detection and correction) and flow control themselves. 


This viewpoint leads quickly to the conclusion that the network service should be 
connectionless, with primitives SEND PACKET and RECEIVE PACKET and little else. In 
particular, no packet ordering and flow control should be done, because the hosts are going to 
do that anyway, and there is usually little to be gained by doing it twice. Furthermore, each 
packet must carry the full destination address, because each packet sent is carried 
independently of its predecessors, if any. 


The other camp (represented by the telephone companies) argues that the subnet should 
provide a reliable, connection-oriented service. They claim that 100 years of successful 
experience with the worldwide telephone system is an excellent guide. In this view, quality of 
service is the dominant factor, and without connections in the subnet, quality of service is very 
difficult to achieve, especially for real-time traffic such as voice and video. 


These two camps are best exemplified by the Internet and ATM. The Internet offers 
connectionless network-layer service; ATM networks offer connection-oriented network-layer 
service. However, it is interesting to note that as quality-of-service guarantees are becoming 
more and more important, the Internet is evolving. In particular, it is starting to acquire 
properties normally associated with connection-oriented service, as we will see later. Actually, 
we got an inkling of this evolution during our study of VLANs in Chap. 4. 


5.1.3 Implementation of Connectionless Service 


Having looked at the two classes of service the network layer can provide to its users, it is 
time to see how this layer works inside. Two different organizations are possible, depending on 
the type of service offered. If connectionless service is offered, packets are injected into the 
subnet individually and routed independently of each other. No advance setup is needed. In 
this context, the packets are frequently called datagrams (in analogy with telegrams) and the 
subnet is called a datagram subnet. If connection-oriented service is used, a path from the 
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source router to the destination router must be established before any data packets can be 
sent. This connection is called a VC (virtual circuit), in analogy with the physical circuits set 
up by the telephone system, and the subnet is called a virtual-circuit subnet. In this section 
we will examine datagram subnets; in the next one we will examine virtual-circuit subnets. 


Let us now see how a datagram subnet works. Suppose that the process P1 in Fig. 5-2 has a 
long message for P2. It hands the message to the transport layer with instructions to deliver it 
to process P2 on host H2. The transport layer code runs on H1, typically within the operating 
system. It prepends a transport header to the front of the message and hands the result to the 
network layer, probably just another procedure within the operating system. 


Figure 5-2. Routing within a datagram subnet. 
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Let us assume that the message is four times longer than the maximum packet size, so the 
network layer has to break it into four packets, 1, 2, 3, and 4 and sends each of them in turn 
to router A using some point-to-point protocol, for example, PPP. At this point the carrier takes 
over. Every router has an internal table telling it where to send packets for each possible 
destination. Each table entry is a pair consisting of a destination and the outgoing line to use 
for that destination. Only directly-connected lines can be used. For example, in Fig. 5-2, A has 
only two outgoing lines—to B and C—so every incoming packet must be sent to one of these 
routers, even if the ultimate destination is some other router. A's initial routing table is shown 
in the figure under the label "initially." 


As they arrived at A, packets 1, 2, and 3 were stored briefly (to verify their checksums). Then 
each was forwarded to C according to A's table. Packet 1 was then forwarded to E and then to 
F. When it got to F, it was encapsulated in a data link layer frame and sent to H2 over the LAN. 
Packets 2 and 3 follow the same route. 


However, something different happened to packet 4. When it got to A it was sent to router B, 
even though it is also destined for F. For some reason, A decided to send packet 4 via a 
different route than that of the first three. Perhaps it learned of a traffic jam somewhere along 
the ACE path and updated its routing table, as shown under the label "later." The algorithm 
that manages the tables and makes the routing decisions is called the routing algorithm. 
Routing algorithms are one of the main things we will study in this chapter. 


209 


5.1.4 Implementation of Connection-Oriented Service 


For connection-oriented service, we need a virtual-circuit subnet. Let us see how that works. 
The idea behind virtual circuits is to avoid having to choose a new route for every packet sent, 
as in Fig. 5-2. Instead, when a connection is established, a route from the source machine to 
the destination machine is chosen as part of the connection setup and stored in tables inside 
the routers. That route is used for all traffic flowing over the connection, exactly the same way 
that the telephone system works. When the connection is released, the virtual circuit is also 
terminated. With connection-oriented service, each packet carries an identifier telling which 
virtual circuit it belongs to. 


As an example, consider the situation of Fig. 5-3. Here, host H1 has established connection 1 
with host H2. It is remembered as the first entry in each of the routing tables. The first line of 
A's table says that if a packet bearing connection identifier 1 comes in from H1, it is to be sent 
to router C and given connection identifier 1. Similarly, the first entry at C routes the packet to 
E, also with connection identifier 1. 


Figure 5-3. Routing within a virtual-circuit subnet. 
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Now let us consider what happens if H3 also wants to establish a connection to H2. It chooses 
connection identifier 1 (because it is initiating the connection and this is its only connection) 
and tells the subnet to establish the virtual circuit. This leads to the second row in the tables. 
Note that we have a conflict here because although A can easily distinguish connection 1 
packets from H1 from connection 1 packets from H3, C cannot do this. For this reason, A 
assigns a different connection identifier to the outgoing traffic for the second connection. 
Avoiding conflicts of this kind is why routers need the ability to replace connection identifiers in 
outgoing packets. In some contexts, this is called label switching. 


5.1.5 Comparison of Virtual-Circuit and Datagram Subnets 


Both virtual circuits and datagrams have their supporters and their detractors. We will now 
attempt to summarize the arguments both ways. The major issues are listed in Fig. 5-4, 
although purists could probably find a counterexample for everything in the figure. 
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Figure 5-4. Comparison of datagram and virtual-circuit subnets. 


| Issue | Datagram subnet |  Virtual-circuit subnet 
, Circuit setup | Not needed _ Required 
Addressing Each packet contains Each packet contains a 
the full source and short VC number 
| | destination address | 
State information Routers do not hold Each VC requires router 
| state information about connections | table space per connection 
Routing Each packet is Route chosen when VC 
routed independently is set up; all packets 
| | . follow it 
Effect of router failures | None, except for packets All VCs that passed 
lost during the crash through the failed 


router are terminated 


Quality of service Difficult Easy if enough resources 
can be allocated in 
advance for each VC 

Congestion control Difficult Easy if enough resources 
can be allocated in 
advance for each VC 


Inside the subnet, several trade-offs exist between virtual circuits and datagrams. One trade- 
off is between router memory space and bandwidth. Virtual circuits allow packets to contain 
circuit numbers instead of full destination addresses. If the packets tend to be fairly short, a 
full destination address in every packet may represent a significant amount of overhead and 
hence, wasted bandwidth. The price paid for using virtual circuits internally is the table space 
within the routers. Depending upon the relative cost of communication circuits versus router 
memory, one or the other may be cheaper. 


Another trade-off is setup time versus address parsing time. Using virtual circuits requires a 
setup phase, which takes time and consumes resources. However, figuring out what to do with 
a data packet in a virtual-circuit subnet is easy: the router just uses the circuit number to 
index into a table to find out where the packet goes. In a datagram subnet, a more 
complicated lookup procedure is required to locate the entry for the destination. 


Yet another issue is the amount of table space required in router memory. A datagram subnet 
needs to have an entry for every possible destination, whereas a virtual-circuit subnet just 
needs an entry for each virtual circuit. However, this advantage is somewhat illusory since 
connection setup packets have to be routed too, and they use destination addresses, the same 
as datagrams do. 


Virtual circuits have some advantages in guaranteeing quality of service and avoiding 
congestion within the subnet because resources (e.g., buffers, bandwidth, and CPU cycles) can 
be reserved in advance, when the connection is established. Once the packets start arriving, 
the necessary bandwidth and router capacity will be there. With a datagram subnet, 
congestion avoidance is more difficult. 


For transaction processing systems (e.g., stores calling up to verify credit card purchases), the 
overhead required to set up and clear a virtual circuit may easily dwarf the use of the circuit. If 
the majority of the traffic is expected to be of this kind, the use of virtual circuits inside the 
subnet makes little sense. On the other hand, permanent virtual circuits, which are set up 
manually and last for months or years, may be useful here. 


Virtual circuits also have a vulnerability problem. If a router crashes and loses its memory, 
even if it comes back up a second later, all the virtual circuits passing through it will have to be 
aborted. In contrast, if a datagram router goes down, only those users whose packets were 
queued in the router at the time will suffer, and maybe not even all those, depending upon 
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whether they have already been acknowledged. The loss of a communication line is fatal to 


virtual circuits using it but can be easily compensated for if datagrams are used. Datagrams 
also allow the routers to balance the traffic throughout the subnet, since routes can be 
changed partway through a long sequence of packet transmissions. 


5.2 Routing Algorithms 


The main function of the network layer is routing packets from the source machine to the 
destination machine. In most subnets, packets will require multiple hops to make the journey. 
The only notable exception is for broadcast networks, but even here routing is an issue if the 
source and destination are not on the same network. The algorithms that choose the routes 
and the data structures that they use are a major area of network layer design. 


The routing algorithm is that part of the network layer software responsible for deciding 
which output line an incoming packet should be transmitted on. If the subnet uses datagrams 
internally, this decision must be made anew for every arriving data packet since the best route 
may have changed since last time. If the subnet uses virtual circuits internally, routing 
decisions are made only when a new virtual circuit is being set up. Thereafter, data packets 
just follow the previously-established route. The latter case is sometimes called session 
routing because a route remains in force for an entire user session (e.g., a login session at a 
terminal or a file transfer). 


It is sometimes useful to make a distinction between routing, which is making the decision 
which routes to use, and forwarding, which is what happens when a packet arrives. One can 
think of a router as having two processes inside it. One of them handles each packet as it 
arrives, looking up the outgoing line to use for it in the routing tables. This process is 
forwarding. The other process is responsible for filling in and updating the routing tables. 
That is where the routing algorithm comes into play. 


Regardless of whether routes are chosen independently for each packet or only when new 
connections are established, certain properties are desirable in a routing algorithm: 
correctness, simplicity, robustness, stability, fairness, and optimality. Correctness and 
simplicity hardly require comment, but the need for robustness may be less obvious at first. 
Once a major network comes on the air, it may be expected to run continuously for years 
without systemwide failures. During that period there will be hardware and software failures of 
all kinds. Hosts, routers, and lines will fail repeatedly, and the topology will change many 
times. The routing algorithm should be able to cope with changes in the topology and traffic 
without requiring all jobs in all hosts to be aborted and the network to be rebooted every time 
some router crashes. 


Stability is also an important goal for the routing algorithm. There exist routing algorithms that 
never converge to equilibrium, no matter how long they run. A stable algorithm reaches 
equilibrium and stays there. Fairness and optimality may sound obvious—surely no reasonable 
person would oppose them-— but as it turns out, they are often contradictory goals. As a simple 
example of this conflict, look at Fig. 5-5. Suppose that there is enough traffic between A and 
A', between B and B', and between C and C' to saturate the horizontal links. To maximize the 
total flow, the X to X' traffic should be shut off altogether. Unfortunately, X and X' may not see 
it that way. Evidently, some compromise between global efficiency and fairness to individual 
connections is needed. 
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Figure 5-5. Conflict between fairness and optimality. 
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Before we can even attempt to find trade-offs between fairness and optimality, we must decide 
what it is we seek to optimize. Minimizing mean packet delay is an obvious candidate, but so is 
maximizing total network throughput. Furthermore, these two goals are also in conflict, since 
operating any queueing system near capacity implies a long queueing delay. As a compromise, 
many networks attempt to minimize the number of hops a packet must make, because 
reducing the number of hops tends to improve the delay and also reduce the amount of 
bandwidth consumed, which tends to improve the throughput as well. 


Routing algorithms can be grouped into two major classes: nonadaptive and adaptive. 
Nonadaptive algorithms do not base their routing decisions on measurements or estimates 
of the current traffic and topology. Instead, the choice of the route to use to get from I to J 
(for all Z and J) is computed in advance, off-line, and downloaded to the routers when the 
network is booted. This procedure is sometimes called static routing. 


Adaptive algorithms, in contrast, change their routing decisions to reflect changes in the 
topology, and usually the traffic as well. Adaptive algorithms differ in where they get their 
information (e.g., locally, from adjacent routers, or from all routers), when they change the 
routes (e.g., every AT sec, when the load changes or when the topology changes), and what 
metric is used for optimization (e.g., distance, number of hops, or estimated transit time). In 
the following sections we will discuss a variety of routing algorithms, both static and dynamic. 


5.2.1 The Optimality Principle 


Before we get into specific algorithms, it may be helpful to note that one can make a general 
statement about optimal routes without regard to network topology or traffic. This statement is 
known as the optimality principle. It states that if router J is on the optimal path from router 
I to router K, then the optimal path from J to K also falls along the same route. To see this, 
call the part of the route from I to Jr: and the rest of the route r2. If a route better than r2 
existed from J to K, it could be concatenated with ri to improve the route from I to K, 
contradicting our statement that rir2 is optimal. 


As a direct consequence of the optimality principle, we can see that the set of optimal routes 
from all sources to a given destination form a tree rooted at the destination. Such a tree is 
called a sink tree and is illustrated in Fig. 5-6, where the distance metric is the number of 
hops. Note that a sink tree is not necessarily unique; other trees with the same path lengths 
may exist. The goal of all routing algorithms is to discover and use the sink trees for all 
routers. 
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Figure 5-6. (a) A subnet. (b) A sink tree for router B. 


(a) (b) 


Since a sink tree is indeed a tree, it does not contain any loops, so each packet will be 
delivered within a finite and bounded number of hops. In practice, life is not quite this easy. 
Links and routers can go down and come back up during operation, so different routers may 
have different ideas about the current topology. Also, we have quietly finessed the issue of 
whether each router has to individually acquire the information on which to base its sink tree 
computation or whether this information is collected by some other means. We will come back 
to these issues shortly. Nevertheless, the optimality principle and the sink tree provide a 
benchmark against which other routing algorithms can be measured. 


5.2.2 Shortest Path Routing 


Let us begin our study of feasible routing algorithms with a technique that is widely used in 
many forms because it is simple and easy to understand. The idea is to build a graph of the 
subnet, with each node of the graph representing a router and each arc of the graph 
representing a communication line (often called a link). To choose a route between a given 
pair of routers, the algorithm just finds the shortest path between them on the graph. 


The concept of a shortest path deserves some explanation. One way of measuring path 
length is the number of hops. Using this metric, the paths ABC and ABE in Fig. 5-7 are equally 
long. Another metric is the geographic distance in kilometers, in which case ABC is clearly 
much longer than ABE (assuming the figure is drawn to scale). 


Figure 5-7. The first five steps used in computing the shortest path 
from A to D. The arrows indicate the working node. 
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However, many other metrics besides hops and physical distance are also possible. For 
example, each arc could be labeled with the mean queueing and transmission delay for some 
standard test packet as determined by hourly test runs. With this graph labeling, the shortest 
path is the fastest path rather than the path with the fewest arcs or kilometers. 


In the general case, the labels on the arcs could be computed as a function of the distance, 
bandwidth, average traffic, communication cost, mean queue length, measured delay, and 
other factors. By changing the weighting function, the algorithm would then compute the 
"shortest" path measured according to any one of a number of criteria or to a combination of 
criteria. 


Several algorithms for computing the shortest path between two nodes of a graph are known. 
This one is due to Dijkstra (1959). Each node is labeled (in parentheses) with its distance from 
the source node along the best known path. Initially, no paths are known, so all nodes are 
labeled with infinity. As the algorithm proceeds and paths are found, the labels may change, 
reflecting better paths. A label may be either tentative or permanent. Initially, all labels are 
tentative. When it is discovered that a label represents the shortest possible path from the 
source to that node, it is made permanent and never changed thereafter. 


To illustrate how the labeling algorithm works, look at the weighted, undirected graph of Fig. 
5-7(a), where the weights represent, for example, distance. We want to find the shortest path 
from A to D. We start out by marking node A as permanent, indicated by a filled-in circle. Then 
we examine, in turn, each of the nodes adjacent to A (the working node), relabeling each one 
with the distance to A. Whenever a node is relabeled, we also label it with the node from which 
the probe was made so that we can reconstruct the final path later. Having examined each of 
the nodes adjacent to A, we examine all the tentatively labeled nodes in the whole graph and 
make the one with the smallest label permanent, as shown in Fig. 5-7(b). This one becomes 
the new working node. 


We now start at B and examine all nodes adjacent to it. If the sum of the label on B and the 
distance from B to the node being considered is less than the label on that node, we have a 
shorter path, so the node is relabeled. 


After all the nodes adjacent to the working node have been inspected and the tentative labels 
changed if possible, the entire graph is searched for the tentatively-labeled node with the 
smallest value. This node is made permanent and becomes the working node for the next 
round. Figure 5-7 shows the first five steps of the algorithm. 


To see why the algorithm works, look at Fig. 5-7(c). At that point we have just made E 
permanent. Suppose that there were a shorter path than ABE, say AXYZE. There are two 
possibilities: either node Z has already been made permanent, or it has not been. If it has, 
then E has already been probed (on the round following the one when Z was made 
permanent), so the AXYZE path has not escaped our attention and thus cannot be a shorter 
path. 


Now consider the case where Z is still tentatively labeled. Either the label at Z is greater than 
or equal to that at E, in which case AXYZE cannot be a shorter path than ABE, or it is less than 
that of E, in which case Z and not E will become permanent first, allowing E to be probed from 
Z. 


This algorithm is given in Fig. 5-8. The global variables n and dist describe the graph and are 
initialized before shortest path is called. The only difference between the program and the 
algorithm described above is that in Fig. 5-8, we compute the shortest path starting at the 
terminal node, t, rather than at the source node, s. Since the shortest path from t to s in an 
undirected graph is the same as the shortest path from s to t, it does not matter at which end 
we begin (unless there are several shortest paths, in which case reversing the search might 
discover a different one). The reason for searching backward is that each node is labeled with 
its predecessor rather than its successor. When the final path is copied into the output 
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variable, path, the path is thus reversed. By reversing the search, the two effects cancel, and 
the answer is produced in the correct order. 


Figure 5-8. Dijkstra's algorithm to compute the shortest path through 


a graph. 
#define MAX NODES 1024 /* maximum number of nodes */ 
#define INFINITY 1000000000 /* a number larger than every maximum path */ 


int n, diste:MAX NODES][MAX NODES]:/* dist[i][j] is the distance from i to j */ 
void shortest path(int s, int t, int path[]) 


( struct state ( /* the path being worked on */ 
int predecessor; /* previous node */ 
int length; /* length from source to this node */ 


enum (permanent, tentative) label; /* label state */ 
} state[MAX_NODES]; 


int i, k, min; 
struct state *p; 


for (p = &state[0]; p < &state[n]; p++) { /* initialize state */ 
p->predecessor = —1; 
p->length = INFINITY; 
p->label = tentative; 


state[t].length = 0; state[t].label = permanent; 


k=t; /* k is the initial working node */ 
do { /* |s there a better path from k? */ 
for (i = 0; i < n; i++) /* this graph has n nodes */ 


if (dist(k][i] != 0 && state[i].label == tentative) { 
if (state[k].length + dist[k][i] < state[i].length) { 
state[i].predecessor - k; 
state[i].'ength = state[k].length + dist[k][i]; 


} 


/* Find the tentatively labeled node with the smallest label. */ 
k = 0; min = INFINITY: 
for (i = 0; i < n; H+) 
if (state[i].label == tentative && state[i].length < min) { 
min = state[i].length; 
k=i; 


state[k].label = permanent; 
) while {k != s); 


/* Copy the path into the output array. */ 
i=0; k=s; 
do {path[i++] = k; k = state[k].predecessor; ) while (k >= 0); 


5.2.3 Flooding 


Another static algorithm is flooding, in which every incoming packet is sent out on every 
outgoing line except the one it arrived on. Flooding obviously generates vast numbers of 
duplicate packets, in fact, an infinite number unless some measures are taken to damp the 
process. One such measure is to have a hop counter contained in the header of each packet, 
which is decremented at each hop, with the packet being discarded when the counter reaches 
zero. Ideally, the hop counter should be initialized to the length of the path from source to 
destination. If the sender does not know how long the path is, it can initialize the counter to 
the worst case, namely, the full diameter of the subnet. 


An alternative technique for damming the flood is to keep track of which packets have been 
flooded, to avoid sending them out a second time. achieve this goal is to have the source 
router put a sequence number in each packet it receives from its hosts. Each router then 
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needs a list per source router telling which sequence numbers originating at that source have 
already been seen. If an incoming packet is on the list, it is not flooded. 


To prevent the list from growing without bound, each list should be augmented by a counter, 
k, meaning that all sequence numbers through k have been seen. When a packet comes in, it 
is easy to check if the packet is a duplicate; if so, it is discarded. Furthermore, the full list 
below k is not needed, since k effectively summarizes it. 


A variation of flooding that is slightly more practical is selective flooding.In this algorithm 
the routers do not send every incoming packet out on every line, only on those lines that are 
going approximately in the right direction. There is usually little point in sending a westbound 
packet on an eastbound line unless the topology is extremely peculiar and the router is sure of 
this fact. 


Flooding is not practical in most applications, but it does have some uses. For example, in 
military applications, where large numbers of routers may be blown to bits at any instant, the 
tremendous robustness of flooding is highly desirable. In distributed database applications, it is 
sometimes necessary to update all the databases concurrently, in which case flooding can be 
useful. In wireless networks, all messages transmitted by a station can be received by all other 
stations within its radio range, which is, in fact, flooding, and some algorithms utilize this 
property. A fourth possible use of flooding is as a metric against which other routing 
algorithms can be compared. Flooding always chooses the shortest path because it chooses 
every possible path in parallel. Consequently, no other algorithm can produce a shorter delay 
(if we ignore the overhead generated by the flooding process itself). 


5.2.4 Distance Vector Routing 


Modern computer networks generally use dynamic routing algorithms rather than the static 
ones described above because static algorithms do not take the current network load into 
account. Two dynamic algorithms in particular, distance vector routing and link state routing, 
are the most popular. In this section we will look at the former algorithm. In the following 
section we will study the latter algorithm. 


Distance vector routing algorithms operate by having each router maintain a table (i.e, a 
vector) giving the best known distance to each destination and which line to use to get there. 
These tables are updated by exchanging information with the neighbors. 


The distance vector routing algorithm is sometimes called by other names, most commonly the 
distributed Bellman-Ford routing algorithm and the Ford-Fulkerson algorithm, after the 
researchers who developed it (Bellman, 1957; and Ford and Fulkerson, 1962). It was the 
original ARPANET routing algorithm and was also used in the Internet under the name RIP. 


In distance vector routing, each router maintains a routing table indexed by, and containing 
one entry for, each router in the subnet. This entry contains two parts: the preferred outgoing 
line to use for that destination and an estimate of the time or distance to that destination. The 
metric used might be number of hops, time delay in milliseconds, total number of packets 
queued along the path, or something similar. 


The router is assumed to know the "distance" to each of its neighbors. If the metric is hops, 
the distance is just one hop. If the metric is queue length, the router simply examines each 
queue. If the metric is delay, the router can measure it directly with special ECHO packets that 
the receiver just timestamps and sends back as fast as it can. 


As an example, assume that delay is used as a metric and that the router knows the delay to 

each of its neighbors. Once every T msec each router sends to each neighbor a list of its 

estimated delays to each destination. It also receives a similar list from each neighbor. 

Imagine that one of these tables has just come in from neighbor X, with X; being X's estimate 

of how long it takes to get to router /. If the router knows that the delay to X is m msec, it also 

knows that it can reach router j via X in Xi + m msec. By performing this calculation for each 
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neighbor, a router can find out which estimate seems the best and use that estimate and the 


corresponding line in its new routing table. Note that the old routing table is not used in the 
calculation. 


This updating process is illustrated in Fig. 5-9. Part (a) shows a subnet. The first four columns 
of part (b) show the delay vectors received from the neighbors of router J. A claims to have a 
12-msec delay to B, a 25-msec delay to C, a 40-msec delay to D, etc. Suppose that J has 
measured or estimated its delay to its neighbors, A, I, H, and K as 8, 10, 12, and 6 msec, 
respectively. 


Figure 5-9. (a) A subnet. (b) Input from A, I, H, K, and the new routing 
table for J. 
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Consider how J computes its new route to router G. It knows that it can get to A in 8 msec, 
and A claims to be able to get to G in 18 msec, so J knows it can count on a delay of 26 msec 
to G if it forwards packets bound for G to A. Similarly, it computes the delay to G via I, H, and 
Kas 41 (31 + 10), 18 (6 + 12), and 37 (31 + 6) msec, respectively. The best of these values 
is 18, so it makes an entry in its routing table that the delay to G is 18 msec and that the 
route to use is via H. The same calculation is performed for all the other destinations, with the 
new routing table shown in the last column of the figure. 


The Count-to-Infinity Problem 


Distance vector routing works in theory but has a serious drawback in practice: although it 
converges to the correct answer, it may do so slowly. In particular, it reacts rapidly to good 
news, but leisurely to bad news. Consider a router whose best route to destination X is large. 
If on the next exchange neighbor A suddenly reports a short delay to X, the router just 
switches over to using the line to A to send traffic to X. In one vector exchange, the good news 
is processed. 


To see how fast good news propagates, consider the five-node (linear) subnet of Fig. 5-10, 
where the delay metric is the number of hops. Suppose A is down initially and all the other 
routers know this. In other words, they have all recorded the delay to A as infinity. 
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Figure 5-10. The count-to-infinity problem. 


A B Cc D E A B Cc D E 
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When A comes up, the other routers learn about it via the vector exchanges. For simplicity we 
will assume that there is a gigantic gong somewhere that is struck periodically to initiate a 
vector exchange at all routers simultaneously. At the time of the first exchange, B learns that 
its left neighbor has zero delay to A. B now makes an entry in its routing table that A is one 
hop away to the left. All the other routers still think that A is down. At this point, the routing 
table entries for A are as shown in the second row of Fig. 5-10(a). On the next exchange, C 
learns that B has a path of length 1 to A, so it updates its routing table to indicate a path of 
length 2, but D and E do not hear the good news until later. Clearly, the good news is 
spreading at the rate of one hop per exchange. In a subnet whose longest path is of length N 
hops, within N exchanges everyone will know about newly-revived lines and routers. 


Now let us consider the situation of Fig. 5-10(b), in which all the lines and routers are initially 
up. Routers B, C, D, and E have distances to A of 1, 2, 3, and 4, respectively. Suddenly A goes 
down, or alternatively, the line between A and B is cut, which is effectively the same thing 
from B's point of view. 


At the first packet exchange, B does not hear anything from A. Fortunately, C says: Do not 
worry; I have a path to A of length 2. Little does B know that C's path runs through B itself. 
For all B knows, C might have ten lines all with separate paths to A of length 2. As a result, B 
thinks it can reach A via C, with a path length of 3. D and E do not update their entries for A 
on the first exchange. 


On the second exchange, C notices that each of its neighbors claims to have a path to A of 
length 3. It picks one of the them at random and makes its new distance to A 4, as shown in 
the third row of Fig. 5-10(b). Subsequent exchanges produce the history shown in the rest of 


Fig. 5-10(b). 


From this figure, it should be clear why bad news travels slowly: no router ever has a value 
more than one higher than the minimum of all its neighbors. Gradually, all routers work their 
way up to infinity, but the number of exchanges required depends on the numerical value used 
for infinity. For this reason, it is wise to set infinity to the longest path plus 1. If the metric is 
time delay, there is no well-defined upper bound, so a high value is needed to prevent a path 
with a long delay from being treated as down. Not entirely surprisingly, this problem is known 
as the count-to-infinity problem. There have been a few attempts to solve it (such as split 
horizon with poisoned reverse in RFC 1058), but none of these work well in general. The core 
of the problem is that when X tells Y that it has a path somewhere, Y has no way of knowing 
whether it itself is on the path. 


5.2.5 Link State Routing 


Distance vector routing was used in the ARPANET until 1979, when it was replaced by link 
state routing. Two primary problems caused its demise. First, since the delay metric was 
queue length, it did not take line bandwidth into account when choosing routes. Initially, all 
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the lines were 56 kbps, so line bandwidth was not an issue, but after some lines had been 


upgraded to 230 kbps and others to 1.544 Mbps, not taking bandwidth into account was a 
major problem. Of course, it would have been possible to change the delay metric to factor in 
line bandwidth, but a second problem also existed, namely, the algorithm often took too long 
to converge (the count-to-infinity problem). For these reasons, it was replaced by an entirely 
new algorithm, now called link state routing. Variants of link state routing are now widely 
used. 


The idea behind link state routing is simple and can be stated as five parts. Each router must 
do the following: 


Discover its neighbors and learn their network addresses. 
Measure the delay or cost to each of its neighbors. 
Construct a packet telling all it has just learned. 

Send this packet to all other routers. 

Compute the shortest path to every other router. 


pu S 


In effect, the complete topology and all delays are experimentally measured and distributed to 
every router. Then Dijkstra's algorithm can be run to find the shortest path to every other 
router. Below we will consider each of these five steps in more detail. 


Learning about the Neighbors 


When a router is booted, its first task is to learn who its neighbors are. It accomplishes this 
goal by sending a special HELLO packet on each point-to-point line. The router on the other 
end is expected to send back a reply telling who it is. These names must be globally unique 
because when a distant router later hears that three routers are all connected to F, it is 
essential that it can determine whether all three mean the same F. 


When two or more routers are connected by a LAN, the situation is slightly more complicated. 


Fig. 5-11(a) illustrates a LAN to which three routers, A, C, and F, are directly connected. Each 
of these routers is connected to one or more additional routers, as shown. 


Figure 5-11. (a) Nine routers and a LAN. (b) A graph model of (a). 
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One way to model the LAN is to consider it as a node itself, as shown in Fig. 5-11(b). Here we 
have introduced a new, artificial node, N, to which A, C, and F are connected. The fact that it is 
possible to go from A to C on the LAN is represented by the path ANC here. 


Measuring Line Cost 


The link state routing algorithm requires each router to know, or at least have a reasonable 
estimate of, the delay to each of its neighbors. The most direct way to determine this delay is 
to send over the line a special ECHO packet that the other side is required to send back 
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immediately. By measuring the round-trip time and dividing it by two, the sending router can 
get a reasonable estimate of the delay. For even better results, the test can be conducted 
several times, and the average used. Of course, this method implicitly assumes the delays are 
symmetric, which may not always be the case. 


An interesting issue is whether to take the load into account when measuring the delay. To 

factor the load in, the round-trip timer must be started when the ECHO packet is queued. To 
ignore the load, the timer should be started when the ECHO packet reaches the front of the 
queue. 


Arguments can be made both ways. Including traffic-induced delays in the measurements 
means that when a router has a choice between two lines with the same bandwidth, one of 
which is heavily loaded all the time and one of which is not, the router will regard the route 
over the unloaded line as a shorter path. This choice will result in better performance. 


Unfortunately, there is also an argument against including the load in the delay calculation. 
Consider the subnet of Fig. 5-12, which is divided into two parts, East and West, connected by 
two lines, CF and EI. 


Figure 5-12. A subnet in which the East and West parts are connected 
by two lines. 


 — 
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Suppose that most of the traffic between East and West is using line CF, and as a result, this 
line is heavily loaded with long delays. Including queueing delay in the shortest path 
calculation will make EI more attractive. After the new routing tables have been installed, most 
of the East-West traffic will now go over ET, overloading this line. Consequently, in the next 
update, CF will appear to be the shortest path. As a result, the routing tables may oscillate 
wildly, leading to erratic routing and many potential problems. If load is ignored and only 
bandwidth is considered, this problem does not occur. Alternatively, the load can be spread 
over both lines, but this solution does not fully utilize the best path. Nevertheless, to avoid 
oscillations in the choice of best path, it may be wise to distribute the load over multiple lines, 
with some known fraction going over each line. 


Building Link State Packets 


Once the information needed for the exchange has been collected, the next step is for each 
router to build a packet containing all the data. The packet starts with the identity of the 
sender, followed by a sequence number and age (to be described later), and a list of 
neighbors. For each neighbor, the delay to that neighbor is given. An example subnet is given 
in Fig. 5-13(a) with delays shown as labels on the lines. The corresponding link state packets 
for all six routers are shown in Fig. 5-13(b). 
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Figure 5-13. (a) A subnet. (b) The link state packets for this subnet. 
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Building the link state packets is easy. The hard part is determining when to build them. One 
possibility is to build them periodically, that is, at regular intervals. Another possibility is to 
build them when some significant event occurs, such as a line or neighbor going down or 
coming back up again or changing its properties appreciably. 


Distributing the Link State Packets 


The trickiest part of the algorithm is distributing the link state packets reliably. As the packets 
are distributed and installed, the routers getting the first ones will change their routes. 
Consequently, the different routers may be using different versions of the topology, which can 
lead to inconsistencies, loops, unreachable machines, and other problems. 


First we will describe the basic distribution algorithm. Later we will give some refinements. The 
fundamental idea is to use flooding to distribute the link state packets. To keep the flood in 
check, each packet contains a sequence number that is incremented for each new packet sent. 
Routers keep track of all the (source router, sequence) pairs they see. When a new link state 
packet comes in, it is checked against the list of packets already seen. If it is new, it is 
forwarded on all lines except the one it arrived on. If it is a duplicate, it is discarded. If a 
packet with a sequence number lower than the highest one seen so far ever arrives, it is 
rejected as being obsolete since the router has more recent data. 


This algorithm has a few problems, but they are manageable. First, if the sequence numbers 
wrap around, confusion will reign. The solution here is to use a 32-bit sequence number. With 
one link state packet per second, it would take 137 years to wrap around, so this possibility 
can be ignored. 


Second, if a router ever crashes, it will lose track of its sequence number. If it starts again at 
0, the next packet will be rejected as a duplicate. 


Third, if a sequence number is ever corrupted and 65,540 is received instead of 4 (a 1-bit 
error), packets 5 through 65,540 will be rejected as obsolete, since the current sequence 
number is thought to be 65,540. 


The solution to all these problems is to include the age of each packet after the sequence 
number and decrement it once per second. When the age hits zero, the information from that 
router is discarded. Normally, a new packet comes in, say, every 10 sec, so router information 
only times out when a router is down (or six consecutive packets have been lost, an unlikely 
event). The Age field is also decremented by each router during the initial flooding process, to 
make sure no packet can get lost and live for an indefinite period of time (a packet whose age 
is zero is discarded). 


Some refinements to this algorithm make it more robust. When a link state packet comes in to 

a router for flooding, it is not queued for transmission immediately. Instead it is first put in a 

holding area to wait a short while. If another link state packet from the same source comes in 

before the first packet is transmitted, their sequence numbers are compared. If they are equal, 
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the duplicate is discarded. If they are different, the older one is thrown out. To guard against 
errors on the router-router lines, all link state packets are acknowledged. When a line goes 
idle, the holding area is scanned in round-robin order to select a packet or acknowledgement 
to send. 


The data structure used by router B for the subnet shown in Fig. 5-13(a) is depicted in Fig. 5- 
14. Each row here corresponds to a recently-arrived, but as yet not fully-processed, link state 
packet. The table records where the packet originated, its sequence number and age, and the 
data. In addition, there are send and acknowledgement flags for each of B's three lines (to A, 
C, and F, respectively). The send flags mean that the packet must be sent on the indicated 
line. The acknowledgement flags mean that it must be acknowledged there. 


Figure 5-14. The packet buffer for router B in Fig. 5-13. 
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In Fig. 5-14, the link state packet from A arrives directly, so it must be sent to C and F and 
acknowledged to A, as indicated by the flag bits. Similarly, the packet from F has to be 
forwarded to A and C and acknowledged to F. 


However, the situation with the third packet, from E, is different. It arrived twice, once via EAB 
and once via EFB. Consequently, it has to be sent only to C but acknowledged to both A and F, 
as indicated by the bits. 


If a duplicate arrives while the original is still in the buffer, bits have to be changed. For 
example, if a copy of C's state arrives from F before the fourth entry in the table has been 
forwarded, the six bits will be changed to 100011 to indicate that the packet must be 
acknowledged to F but not sent there. 


Computing the New Routes 


Once a router has accumulated a full set of link state packets, it can construct the entire 
subnet graph because every link is represented. Every link is, in fact, represented twice, once 
for each direction. The two values can be averaged or used separately. 


Now Dijkstra's algorithm can be run locally to construct the shortest path to all possible 
destinations. The results of this algorithm can be installed in the routing tables, and normal 
operation resumed. 


For a subnet with n routers, each of which has k neighbors, the memory required to store the 
input data is proportional to kn. For large subnets, this can be a problem. Also, the 
computation time can be an issue. Nevertheless, in many practical situations, link state routing 
works well. 


However, problems with the hardware or software can wreak havoc with this algorithm (also 
with other ones). For example, if a router claims to have a line it does not have or forgets a 
line it does have, the subnet graph will be incorrect. If a router fails to forward packets or 
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corrupts them while forwarding them, trouble will arise. Finally, if it runs out of memory or 
does the routing calculation wrong, bad things will happen. As the subnet grows into the range 
of tens or hundreds of thousands of nodes, the probability of some router failing occasionally 
becomes nonnegligible. The trick is to try to arrange to limit the damage when the inevitable 
happens. Perlman (1988) discusses these problems and their solutions in detail. 


Link state routing is widely used in actual networks, so a few words about some example 
protocols using it are in order. The OSPF protocol, which is widely used in the Internet, uses a 
link state algorithm. We will describe OSPF in Sec. 5.6.4. 


Another link state protocol is IS-IS (Intermediate System-Intermediate System), which 
was designed for DECnet and later adopted by ISO for use with its connectionless network 
layer protocol, CLNP. Since then it has been modified to handle other protocols as well, most 
notably, IP. IS-IS is used in some Internet backbones (including the old NSFNET backbone) 
and in some digital cellular systems such as CDPD. Novell NetWare uses a minor variant of IS- 
IS (NLSP) for routing IPX packets. 


Basically IS-IS distributes a picture of the router topology, from which the shortest paths are 
computed. Each router announces, in its link state information, which network layer addresses 
it can reach directly. These addresses can be IP, IPX, AppleTalk, or any other addresses. IS-IS 
can even support multiple network layer protocols at the same time. 


Many of the innovations designed for IS-IS were adopted by OSPF (OSPF was designed several 
years after IS-IS). These include a self-stabilizing method of flooding link state updates, the 
concept of a designated router on a LAN, and the method of computing and supporting path 
splitting and multiple metrics. As a consequence, there is very little difference between IS-IS 
and OSPF. The most important difference is that IS-IS is encoded in such a way that it is easy 
and natural to simultaneously carry information about multiple network layer protocols, a 
feature OSPF does not have. This advantage is especially valuable in large multiprotocol 
environments. 


5.2.6 Hierarchical Routing 


As networks grow in size, the router routing tables grow proportionally. Not only is router 
memory consumed by ever-increasing tables, but more CPU time is needed to scan them and 
more bandwidth is needed to send status reports about them. At a certain point the network 
may grow to the point where it is no longer feasible for every router to have an entry for every 
other router, so the routing will have to be done hierarchically, as it is in the telephone 
network. 


When hierarchical routing is used, the routers are divided into what we will call regions, with 
each router knowing all the details about how to route packets to destinations within its own 
region, but knowing nothing about the internal structure of other regions. When different 
networks are interconnected, it is natural to regard each one as a separate region in order to 
free the routers in one network from having to know the topological structure of the other 
ones. 


For huge networks, a two-level hierarchy may be insufficient; it may be necessary to group the 
regions into clusters, the clusters into zones, the zones into groups, and so on, until we run 
out of names for aggregations. As an example of a multilevel hierarchy, consider how a packet 
might be routed from Berkeley, California, to Malindi, Kenya. The Berkeley router would know 
the detailed topology within California but would send all out-of-state traffic to the Los Angeles 
router. The Los Angeles router would be able to route traffic to other domestic routers but 
would send foreign traffic to New York. The New York router would be programmed to direct all 
traffic to the router in the destination country responsible for handling foreign traffic, say, in 
Nairobi. Finally, the packet would work its way down the tree in Kenya until it got to Malindi. 


Figure 5-15 gives a quantitative example of routing in a two-level hierarchy with five regions. 
The full routing table for router 1A has 17 entries, as shown in Fig. 5-15(b). When routing is 
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done hierarchically, as in Fig. 5-15(c), there are entries for all the local routers as before, but 
all other regions have been condensed into a single router, so all traffic for region 2 goes via 
the 1B -2A line, but the rest of the remote traffic goes via the 1C -3B line. Hierarchical routing 
has reduced the table from 17 to 7 entries. As the ratio of the number of regions to the 
number of routers per region grows, the savings in table space increase. 


Figure 5-15. Hierarchical routing. 
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Unfortunately, these gains in space are not free. There is a penalty to be paid, and this penalty 
is in the form of increased path length. For example, the best route from 1A to 5C is via region 
2, but with hierarchical routing all traffic to region 5 goes via region 3, because that is better 
for most destinations in region 5. 


When a single network becomes very large, an interesting question is: How many levels should 
the hierarchy have? For example, consider a subnet with 720 routers. If there is no hierarchy, 
each router needs 720 routing table entries. If the subnet is partitioned into 24 regions of 30 
routers each, each router needs 30 local entries plus 23 remote entries for a total of 53 
entries. If a three-level hierarchy is chosen, with eight clusters, each containing 9 regions of 
10 routers, each router needs 10 entries for local routers, 8 entries for routing to other regions 
within its own cluster, and 7 entries for distant clusters, for a total of 25 entries. Kamoun and 
Kleinrock (1979) discovered that the optimal number of levels for an N router subnet is In N, 
requiring a total of e In N entries per router. They have also shown that the increase in 
effective mean path length caused by hierarchical routing is sufficiently small that it is usually 
acceptable. 


5.2.7 Broadcast Routing 


In some applications, hosts need to send messages to many or all other hosts. For example, a 
service distributing weather reports, stock market updates, or live radio programs might work 
best by broadcasting to all machines and letting those that are interested read the data. 
Sending a packet to all destinations simultaneously is called broadcasting; various methods 
have been proposed for doing it. 

One broadcasting method that requires no special features from the subnet is for the source to 
simply send a distinct packet to each destination. Not only is the method wasteful of 
bandwidth, but it also requires the source to have a complete list of all destinations. In practice 
this may be the only possibility, but it is the least desirable of the methods. 
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Flooding is another obvious candidate. Although flooding is ill-suited for ordinary point-to-point 
communication, for broadcasting it might rate serious consideration, especially if none of the 
methods described below are applicable. The problem with flooding as a broadcast technique is 
the same problem it has as a point-to-point routing algorithm: it generates too many packets 
and consumes too much bandwidth. 


A third algorithm is multidestination routing. If this method is used, each packet contains 
either a list of destinations or a bit map indicating the desired destinations. When a packet 
arrives at a router, the router checks all the destinations to determine the set of output lines 
that will be needed. (An output line is needed if it is the best route to at least one of the 
destinations.) The router generates a new copy of the packet for each output line to be used 
and includes in each packet only those destinations that are to use the line. In effect, the 
destination set is partitioned among the output lines. After a sufficient number of hops, each 
packet will carry only one destination and can be treated as a normal packet. Multidestination 
routing is like separately addressed packets, except that when several packets must follow the 
same route, one of them pays full fare and the rest ride free. 


A fourth broadcast algorithm makes explicit use of the sink tree for the router initiating the 
broadcast—or any other convenient spanning tree for that matter. A spanning tree is a 
subset of the subnet that includes all the routers but contains no loops. If each router knows 
which of its lines belong to the spanning tree, it can copy an incoming broadcast packet onto 
all the spanning tree lines except the one it arrived on. This method makes excellent use of 
bandwidth, generating the absolute minimum number of packets necessary to do the job. The 
only problem is that each router must have knowledge of some spanning tree for the method 
to be applicable. Sometimes this information is available (e.g., with link state routing) but 
sometimes it is not (e.g., with distance vector routing). 


Our last broadcast algorithm is an attempt to approximate the behavior of the previous one, 
even when the routers do not know anything at all about spanning trees. The idea, called 
reverse path forwarding, is remarkably simple once it has been pointed out. When a 
broadcast packet arrives at a router, the router checks to see if the packet arrived on the line 
that is normally used for sending packets to the source of the broadcast. If so, there is an 
excellent chance that the broadcast packet itself followed the best route from the router and is 
therefore the first copy to arrive at the router. This being the case, the router forwards copies 
of it onto all lines except the one it arrived on. If, however, the broadcast packet arrived on a 
line other than the preferred one for reaching the source, the packet is discarded as a likely 
duplicate. 


An example of reverse path forwarding is shown in Fig. 5-16. Part (a) shows a subnet, part (b) 
shows a sink tree for router J of that subnet, and part (c) shows how the reverse path 
algorithm works. On the first hop, J sends packets to F, H, J, and N, as indicated by the second 
row of the tree. Each of these packets arrives on the preferred path to I (assuming that the 
preferred path falls along the sink tree) and is so indicated by a circle around the letter. On the 
second hop, eight packets are generated, two by each of the routers that received a packet on 
the first hop. As it turns out, all eight of these arrive at previously unvisited routers, and five of 
these arrive along the preferred line. Of the six packets generated on the third hop, only three 
arrive on the preferred path (at C, E, and K); the others are duplicates. After five hops and 24 
packets, the broadcasting terminates, compared with four hops and 14 packets had the sink 
tree been followed exactly. 
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Figure 5-16. Reverse path forwarding. (a) A subnet. (b) A sink tree. 
(c) The tree built by reverse path forwarding. 


The principal advantage of reverse path forwarding is that it is both reasonably efficient and 
easy to implement. It does not require routers to know about spanning trees, nor does it have 
the overhead of a destination list or bit map in each broadcast packet as does multidestination 
addressing. Nor does it require any special mechanism to stop the process, as flooding does 
(either a hop counter in each packet and a priori knowledge of the subnet diameter, or a list of 
packets already seen per source). 


5.2.8 Multicast Routing 


Some applications require that widely-separated processes work together in groups, for 
example, a group of processes implementing a distributed database system. In these 
situations, it is frequently necessary for one process to send a message to all the other 
members of the group. If the group is small, it can just send each other member a point-to- 
point message. If the group is large, this strategy is expensive. Sometimes broadcasting can 
be used, but using broadcasting to inform 1000 machines on a million-node network is 
inefficient because most receivers are not interested in the message (or worse yet, they are 
definitely interested but are not supposed to see it). Thus, we need a way to send messages to 
well-defined groups that are numerically large in size but small compared to the network as a 
whole. 


Sending a message to such a group is called multicasting, and its routing algorithm is called 
multicast routing. In this section we will describe one way of doing multicast routing. For 
additional information, see (Chu et al., 2000; Costa et al. 2001; Kasera et al., 2000; Madruga 
and Garcia-Luna-Aceves, 2001; Zhang and Ryu, 2001). 


Multicasting requires group management. Some way is needed to create and destroy groups, 
and to allow processes to join and leave groups. How these tasks are accomplished is not of 
concern to the routing algorithm. What is of concern is that when a process joins a group, it 
informs its host of this fact. It is important that routers know which of their hosts belong to 
which groups. Either hosts must inform their routers about changes in group membership, or 
routers must query their hosts periodically. Either way, routers learn about which of their hosts 
are in which groups. Routers tell their neighbors, so the information propagates through the 
subnet. 


To do multicast routing, each router computes a spanning tree covering all other routers. For 
example, in Fig. 5-17(a) we have two groups, 1 and 2. Some routers are attached to hosts 
that belong to one or both of these groups, as indicated in the figure. A spanning tree for the 
leftmost router is shown in Fig. 5-17(b). 
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Figure 5-17. (a) A network. (b) A spanning tree for the leftmost 
router. (c) A multicast tree for group 1. (d) A multicast tree for group 


2 1 
TON —T 
~ 2 

2 


E > a 2 . 
“es " 
1 2 
1 * ° 2 me 2 2 
œ 
m . 
1 T 
bd E e 
1 E 


(c) (d) 


When a process sends a multicast packet to a group, the first router examines its spanning 
tree and prunes it, removing all lines that do not lead to hosts that are members of the group. 
In our example, Fig. 5-17(c) shows the pruned spanning tree for group 1. Similarly, Fig. 5- 
17(d) shows the pruned spanning tree for group 2. Multicast packets are forwarded only along 
the appropriate spanning tree. 


Various ways of pruning the spanning tree are possible. The simplest one can be used if link 
state routing is used and each router is aware of the complete topology, including which hosts 
belong to which groups. Then the spanning tree can be pruned, starting at the end of each 
path, working toward the root, and removing all routers that do not belong to the group in 
question. 


With distance vector routing, a different pruning strategy can be followed. The basic algorithm 
is reverse path forwarding. However, whenever a router with no hosts interested in a particular 
group and no connections to other routers receives a multicast message for that group, it 
responds with a PRUNE message, telling the sender not to send it any more multicasts for that 
group. When a router with no group members among its own hosts has received such 
messages on all its lines, it, too, can respond with a PRUNE message. In this way, the subnet 
is recursively pruned. 


One potential disadvantage of this algorithm is that it scales poorly to large networks. Suppose 
that a network has n groups, each with an average of m members. For each group, m pruned 
spanning trees must be stored, for a total of mn trees. When many large groups exist, 
considerable storage is needed to store all the trees. 


An alternative design uses core-based trees (Ballardie et al., 1993). Here, a single spanning 
tree per group is computed, with the root (the core) near the middle of the group. To send a 
multicast message, a host sends it to the core, which then does the multicast along the 
spanning tree. Although this tree will not be optimal for all sources, the reduction in storage 
costs from m trees to one tree per group is a major saving. 


228 


5.2.9 Routing for Mobile Hosts 


Millions of people have portable computers nowadays, and they generally want to read their e- 
mail and access their normal file systems wherever in the world they may be. These mobile 
hosts introduce a new complication: to route a packet to a mobile host, the network first has 
to find it. The subject of incorporating mobile hosts into a network is very young, but in this 
section we will sketch some of the issues and give a possible solution. 


The model of the world that network designers typically use is shown in Fig. 5-18. Here we 
have a WAN consisting of routers and hosts. Connected to the WAN are LANs, MANs, and 
wireless cells of the type we studied in Chap. 2. 


Figure 5-18. A WAN to which LANs, MANs, and wireless cells are 
attached. 
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Hosts that never move are said to be stationary. They are connected to the network by copper 
wires or fiber optics. In contrast, we can distinguish two other kinds of hosts. Migratory hosts 
are basically stationary hosts who move from one fixed site to another from time to time but 
use the network only when they are physically connected to it. Roaming hosts actually 
compute on the run and want to maintain their connections as they move around. We will use 
the term mobile hosts to mean either of the latter two categories, that is, all hosts that are 
away from home and still want to be connected. 


All hosts are assumed to have a permanent home location that never changes. Hosts also 
have a permanent home address that can be used to determine their home locations, 
analogous to the way the telephone number 1-212-5551212 indicates the United States 
(country code 1) and Manhattan (212). The routing goal in systems with mobile hosts is to 
make it possible to send packets to mobile hosts using their home addresses and have the 
packets efficiently reach them wherever they may be. The trick, of course, is to find them. 


In the model of Fig. 5-18, the world is divided up (geographically) into small units. Let us call 
them areas, where an area is typically a LAN or wireless cell. Each area has one or more 
foreign agents, which are processes that keep track of all mobile hosts visiting the area. In 
addition, each area has a home agent, which keeps track of hosts whose home is in the area, 
but who are currently visiting another area. 


When a new host enters an area, either by connecting to it (e.g., plugging into the LAN) or just 
wandering into the cell, his computer must register itself with the foreign agent there. The 
registration procedure typically works like this: 


1. Periodically, each foreign agent broadcasts a packet announcing its existence and 
address. A newly-arrived mobile host may wait for one of these messages, but if none 
arrives quickly enough, the mobile host can broadcast a packet saying: Are there any 
foreign agents around? 
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2. The mobile host registers with the foreign agent, giving its home address, current data 
link layer address, and some security information. 


3. The foreign agent contacts the mobile host's home agent and says: One of your hosts is 
over here. The message from the foreign agent to the home agent contains the foreign 
agent's network address. It also includes the security information to convince the home 
agent that the mobile host is really there. 

4. The home agent examines the security information, which contains a timestamp, to 
prove that it was generated within the past few seconds. If it is happy, it tells the 
foreign agent to proceed. 

5. When the foreign agent gets the acknowledgement from the home agent, it makes an 
entry in its tables and informs the mobile host that it is now registered. 


Ideally, when a host leaves an area, that, too, should be announced to allow deregistration, 
but many users abruptly turn off their computers when done. 


When a packet is sent to a mobile host, it is routed to the host's home LAN because that is 
what the address says should be done, as illustrated in step 1 of Fig. 5- 19. Here the sender, in 
the northwest city of Seattle, wants to send a packet to a host normally across the United 
States in New York. Packets sent to the mobile host on its home LAN in New York are 
intercepted by the home agent there. The home agent then looks up the mobile host's new 
(temporary) location and finds the address of the foreign agent handling the mobile host, in 
Los Angeles. 


Figure 5-19. Packet routing for mobile hosts. 
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The home agent then does two things. First, it encapsulates the packet in the payload field of 
an outer packet and sends the latter to the foreign agent (step 2 in Fig. 5-19). This mechanism 
is called tunneling; we will look at it in more detail later. After getting the encapsulated packet, 
the foreign agent removes the original packet from the payload field and sends it to the mobile 
host as a data link frame. 


Second, the home agent tells the sender to henceforth send packets to the mobile host by 
encapsulating them in the payload of packets explicitly addressed to the foreign agent instead 
of just sending them to the mobile host's home address (step 3). Subsequent packets can now 
be routed directly to the host via the foreign agent (step 4), bypassing the home location 
entirely. 


The various schemes that have been proposed differ in several ways. First, there is the issue of 


230 


how much of this protocol is carried out by the routers and how much by the hosts, and in the 


latter case, by which layer in the hosts. Second, in a few schemes, routers along the way 
record mapped addresses so they can intercept and redirect traffic even before it gets to the 
home location. Third, in some schemes each visitor is given a unique temporary address; in 
others, the temporary address refers to an agent that handles traffic for all visitors. 


Fourth, the schemes differ in how they actually manage to arrange for packets that are 
addressed to one destination to be delivered to a different one. One choice is changing the 
destination address and just retransmitting the modified packet. Alternatively, the whole 
packet, home address and all, can be encapsulated inside the payload of another packet sent 
to the temporary address. Finally, the schemes differ in their security aspects. In general, 
when a host or router gets a message of the form "Starting right now, please send all of 
Stephany's mail to me," it might have a couple of questions about whom it was talking to and 
whether this is a good idea. Several mobile host protocols are discussed and compared in (Hac 
and Guo, 2000; Perkins, 1998a; Snoeren and Balakrishnan, 2000; Solomon, 1998; and Wang 
and Chen, 2001). 


5.2.10 Routing in Ad Hoc Networks 


We have now seen how to do routing when the hosts are mobile but the routers are fixed. An 
even more extreme case is one in which the routers themselves are mobile. Among the 
possibilities are: 


Military vehicles on a battlefield with no existing infrastructure. 

A fleet of ships at sea. 

Emergency workers at an earthquake that destroyed the infrastructure. 

A gathering of people with notebook computers in an area lacking 802.11. 
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In all these cases, and others, each node consists of a router and a host, usually on the same 
computer. Networks of nodes that just happen to be near each other are called ad hoc 
networks or MANETs (Mobile Ad hoc NETworks). Let us now examine them briefly. More 
information can be found in (Perkins, 2001). 


What makes ad hoc networks different from wired networks is that all the usual rules about 
fixed topologies, fixed and known neighbors, fixed relationship between IP address and 
location, and more are suddenly tossed out the window. Routers can come and go or appear in 
new places at the drop of a bit. With a wired network, if a router has a valid path to some 
destination, that path continues to be valid indefinitely (barring a failure somewhere in the 
system). With an ad hoc network, the topology may be changing all the time, so desirability 
and even validity of paths can change spontaneously, without warning. Needless to say, these 
circumstances make routing in ad hoc networks quite different from routing in their fixed 
counterparts. 


A variety of routing algorithms for ad hoc networks have been proposed. One of the more 
interesting ones is the AODV (Ad hoc On-demand Distance Vector) routing algorithm 
(Perkins and Royer, 1999). It is a distant relative of the Bellman-Ford distance vector 
algorithm but adapted to work in a mobile environment and takes into account the limited 
bandwidth and low battery life found in this environment. Another unusual characteristic is that 
it is an on-demand algorithm, that is, it determines a route to some destination only when 
somebody wants to send a packet to that destination. Let us now see what that means. 


Route Discovery 


At any instant of time, an ad hoc network can be described by a graph of the nodes (routers + 
hosts). Two nodes are connected (i.e., have an arc between them in the graph) if they can 
communicate directly using their radios. Since one of the two may have a more powerful 
transmitter than the other, it is possible that A is connected to B but B is not connected to A. 
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However, for simplicity, we will assume all connections are symmetric. It should also be noted 
that the mere fact that two nodes are within radio range of each other does not mean that 
they are connected. There may be buildings, hills, or other obstacles that block their 
communication. 


To describe the algorithm, consider the ad hoc network of Fig. 5-20, in which a process at 
node A wants to send a packet to node J. The AODV algorithm maintains a table at each node, 
keyed by destination, giving information about that destination, including which neighbor to 
send packets to in order to reach the destination. Suppose that A looks in its table and does 
not find an entry for J. It now has to discover a route to I. This property of discovering routes 
only when they are needed is what makes this algorithm "on demand." 


Figure 5-20. (a) Range of A's broadcast. (b) After B and D have 
received A's broadcast. (c) After C, F, and G have received A's 
broadcast. (d) After E, H, and I have received A's broadcast. The 
shaded nodes are new recipients. The arrows show the possible 
reverse routes. 


Range of 
~~ A's broadcast 


(a) (b) (c) (d) 


To locate I, A constructs a special ROUTE REQUEST packet and broadcasts it. The packet 
reaches B and D, as illustrated in Fig. 5-20(a). In fact, the reason B and D are connected to A 
in the graph is that they can receive communication from A. F, for example, is not shown with 
an arc to A because it cannot receive A's radio signal. Thus, F is not connected to A. 


The format of the ROUTE REQUEST packet is shown in Fig. 5-21. It contains the source and 
destination addresses, typically their IP addresses, which identify who is looking for whom. It 
also contains a Request ID, which is a local counter maintained separately by each node and 
incremented each time a ROUTE REQUEST is broadcast. Together, the Source address and 
Request ID fields uniquely identify the ROUTE REQUEST packet to allow nodes to discard any 
duplicates they may receive. 


Figure 5-21. Format of a ROUTE REQUEST packet. 
Source Request | Destination Source Dest. Hop 
address ID address sequence # sequence # count 


In addition to the Request ID counter, each node also maintains a second sequence counter 
incremented whenever a ROUTE REQUEST is sent (or a reply to someone else's ROUTE 
REQUEST). It functions a little bit like a clock and is used to tell new routes from old routes. 
The fourth field of Fig. 5-21 is A's sequence counter; the fifth field is the most recent value of 
I's sequence number that A has seen (0 if it has never seen it). The use of these fields will 
become clear shortly. The final field, Hop count, will keep track of how many hops the packet 
has made. It is initialized to O. 


When a ROUTE REQUEST packet arrives at a node (B and D in this case), it is processed in the 
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following steps. 


1. The (Source address, Request ID) pair is looked up in a local history table to see if this 
request has already been seen and processed. If it is a duplicate, it is discarded and 
processing stops. If it is not a duplicate, the pair is entered into the history table so 
future duplicates can be rejected, and processing continues. 

2. The receiver looks up the destination in its route table. If a fresh route to the 
destination is known, a ROUTE REPLY packet is sent back to the source telling it how to 
get to the destination (basically: Use me). Fresh means that the Destination sequence 
number stored in the routing table is greater than or equal to the Destination sequence 
number in the ROUTE REQUEST packet. If it is less, the stored route is older than the 
previous route the source had for the destination, so step 3 is executed. 

3. Since the receiver does not know a fresh route to the destination, it increments the Hop 
count field and rebroadcasts the ROUTE REQUEST packet. It also extracts the data from 
the packet and stores it as a new entry in its reverse route table. This information will 
be used to construct the reverse route so that the reply can get back to the source 
later. The arrows in Fig. 5-20 are used for building the reverse route. A timer is also 
started for the newly-made reverse route entry. If it expires, the entry is deleted. 


Neither B nor D knows where I is, so each of them creates a reverse route entry pointing back 
to A, as shown by the arrows in Fig. 5-20, and broadcasts the packet with Hop count set to 1. 
The broadcast from B reaches C and D. C makes an entry for it in its reverse route table and 
rebroadcasts it. In contrast, D rejects it as a duplicate. Similarly, D's broadcast is rejected by 
B. However, D's broadcast is accepted by F and G and stored, as shown in Fig. 5-20(c). After 
E, H, and I receive the broadcast, the ROUTE REQUEST finally reaches a destination that 
knows where I is, namely, I itself, as illustrated in Fig. 5-20(d). Note that although we have 
shown the broadcasts in three discrete steps here, the broadcasts from different nodes are not 
coordinated in any way. 


In response to the incoming request, 7 builds a ROUTE REPLY packet, as shown in Fig. 5-22. 
The Source address, Destination address, and Hop count are copied from the incoming 
request, but the Destination sequence number taken from its counter in memory. The Hop 
count field is set to O. The Lifetime field controls how long the route is valid. This packet is 
unicast to the node that the ROUTE REQUEST packet came from, in this case, G. It then 
follows the reverse path to D and finally to A. At each node, Hop count is incremented so the 
node can see how far from the destination (J) it is. 


Figure 5-22. Format of a ROUTE REPLY packet. 


Source Destination | Destination Hop Lifetime 


address address sequence # count 


At each intermediate node on the way back, the packet is inspected. It is entered into the local 
routing table as a route to I if one or more of the following three conditions are met: 


1. No route to J is known. 

2. The sequence number for I in the ROUTE REPLY packet is greater than the value in the 
routing table. 

3. The sequence numbers are equal but the new route is shorter. 


In this way, all the nodes on the reverse route learn the route to I for free, as a byproduct of 
A's route discovery. Nodes that got the original REQUEST ROUTE packet but were not on the 
reverse path (B, C, E, F, and H in this example) discard the reverse route table entry when the 
associated timer expires. 


In a large network, the algorithm generates many broadcasts, even for destinations that are 
close by. The number of broadcasts can be reduced as follows. The IP packet's Time to live is 
initialized by the sender to the expected diameter of the network and decremented on each 
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hop. If it hits O, the packet is discarded instead of being broadcast. 


The discovery process is then modified as follows. To locate a destination, the sender 
broadcasts a ROUTE REQUEST packet with Time to live set to 1. If no response comes back 
within a reasonable time, another one is sent, this time with Time to live set to 2. Subsequent 
attempts use 3, 4, 5, etc. In this way, the search is first attempted locally, then in increasingly 
wider rings. 


Route Maintenance 


Because nodes can move or be switched off, the topology can change spontaneously. For 
example, in Fig. 5-20, if G is switched off, A will not realize that the route it was using to 7 
(ADGI) is no longer valid. The algorithm needs to be able to deal with this. Periodically, each 
node broadcasts a Hello message. Each of its neighbors is expected to respond to it. If no 
response is forthcoming, the broadcaster knows that that neighbor has moved out of range 
and is no longer connected to it. Similarly, if it tries to send a packet to a neighbor that does 
not respond, it learns that the neighbor is no longer available. 


This information is used to purge routes that no longer work. For each possible destination, 
each node, N, keeps track of its neighbors that have fed it a packet for that destination during 
the last AT seconds. These are called N's active neighbors for that destination. N does this by 
having a routing table keyed by destination and containing the outgoing node to use to reach 
the destination, the hop count to the destination, the most recent destination sequence 
number, and the list of active neighbors for that destination. A possible routing table for node 
D in our example topology is shown in Fig. 5-23(a). 


Figure 5-23. (a) D's routing table before G goes down. (b) The graph 
after G has gone down. 


Next Active Other 
Dest. hop X Distance neighbors fields 


(b) 


When any of N's neighbors becomes unreachable, it checks its routing table to see which 
destinations have routes using the now-gone neighbor. For each of these routes, the active 
neighbors are informed that their route via N is now invalid and must be purged from their 
routing tables. The active neighbors then tell their active neighbors, and so on, recursively, 
until all routes depending on the now-gone node are purged from all routing tables. 


As an example of route maintenance, consider our previous example, but now with G suddenly 
switched off. The changed topology is illustrated in Fig. 5-23(b). When D discovers that G is 
gone, it looks at its routing table and sees that G was used on routes to £E, G, and I. The union 
of the active neighbors for these destinations is the set (A, B). In other words, A and B 
depend on G for some of their routes, so they have to be informed that these routes no longer 


work. D tells them by sending them packets that cause them to update their own routing 
tables accordingly. D also purges the entries for E, G, and I from its routing table. 


It may not have been obvious from our description, but a critical difference between AODV and 
Bellman-Ford is that nodes do not send out periodic broadcasts containing their entire routing 
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table. This difference saves both bandwidth and battery life. 


AODV is also capable of doing broadcast and multicast routing. For details, consult (Perkins 
and Royer, 2001). Ad hoc routing is a red-hot research area. A great deal has been published 
on the topic. A few of the papers include (Chen et al., 2002; Hu and Johnson, 2001; Li et al., 
2001; Raju and Garcia-Luna-Aceves, 2001; Ramanathan and Redi, 2002; Royer and Toh, 
1999; Spohn and Garcia-Luna-Aceves, 2001; Tseng et al., 2001; and Zadeh et al., 2002). 


5.2.11 Node Lookup in Peer-to-Peer Networks 


A relatively new phenomenon is peer-to-peer networks, in which a large number of people, 
usually with permanent wired connections to the Internet, are in contact to share resources. 
The first widespread application of peer-to-peer technology was for mass crime: 50 million 
Napster users were exchanging copyrighted songs without the copyright owners' permission 
until Napster was shut down by the courts amid great controversy. Nevertheless, peer-to-peer 
technology has many interesting and legal uses. It also has something similar to a routing 
problem, although it is not quite the same as the ones we have studied so far. Nevertheless, it 
is worth a quick look. 


What makes peer-to-peer systems interesting is that they are totally distributed. All nodes are 
symmetric and there is no central control or hierarchy. In a typical peer-to-peer system the 
users each have some information that may be of interest to other users. This information may 
be free software, (public domain) music, photographs, and so on. If there are large numbers of 
users, they will not know each other and will not know where to find what they are looking for. 
One solution is a big central database, but this may not be feasible for some reason (e.g., 
nobody is willing to host and maintain it). Thus, the problem comes down to how a user finds a 
node that contains what he is looking for in the absence of a centralized database or even a 
centralized index. 


Let us assume that each user has one or more data items such as songs, photographs, 
programs, files, and so on that other users might want to read. Each item has an ASCII string 
naming it. A potential user knows just the ASCII string and wants to find out if one or more 
people have copies and, if so, what their IP addresses are. 


As an example, consider a distributed genealogical database. Each genealogist has some on- 
line records for his or her ancestors and relatives, possibly with photos, audio, or even video 
clips of the person. Multiple people may have the same great grandfather, so an ancestor may 
have records at multiple nodes. The name of the record is the person's name in some 
canonical form. At some point, a genealogist discovers his great grandfather's will in an 
archive, in which the great grandfather bequeaths his gold pocket watch to his nephew. The 
genealogist now knows the nephew's name and wants to find out if any other genealogist has 
a record for him. How, without a central database, do we find out who, if anyone, has records? 


Various algorithms have been proposed to solve this problem. The one we will examine is 
Chord (Dabek et al., 2001a; and Stoica et al., 2001). A simplified explanation of how it works 
is as follows. The Chord system consists of n participating users, each of whom may have 
some stored records and each of whom is prepared to store bits and pieces of the index for 
use by other users. Each user node has an IP address that can be hashed to an m-bit number 
using a hash function, hash. Chord uses SHA-1 for hash. SHA-1 is used in cryptography; we 
will look at it in Chap. 8. For now, it is just a function that takes a variable-length byte string 
as argument and produces a highly-random 160-bit number. Thus, we can convert any IP 
address to a 160-bit number called the node identifier. 


Conceptually, all the 215? node identifiers are arranged in ascending order in a big circle. Some 
of them correspond to participating nodes, but most of them do not. In Fig. 5-24(a) we show 

the node identifier circle for m = 5 (just ignore the arcs in the middle for the moment). In this 
example, the nodes with identifiers 1, 4, 7, 12, 15, 20, and 27 correspond to actual nodes and 
are shaded in the figure; the rest do not exist. 
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Figure 5-24. (a) A set of 32 node identifiers arranged in a circle. The 
shaded ones correspond to actual machines. The arcs show the fingers 
from nodes 1, 4, and 12. The labels on the arcs are the table indices. 
(b) Examples of the finger tables. 
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Let us now define the function successor(k) as the node identifier of the first actual node 
following k around the circle clockwise. For example, successor (6) = 7, successor (8) = 12, 
and successor (22) = 27. 


The names of the records (song names, ancestors' names, and so on) are also hashed with 
hash (i.e., SHA-1) to generate a 160-bit number, called the key. Thus, to convert name (the 
ASCII name of the record) to its key, we use key = hash(name). This computation is just a 
local procedure call to hash. If a person holding a genealogical record for name wants to make 
it available to everyone, he first builds a tuple consisting of (name, my-IP-address) and then 
asks successor(hash(name)) to store the tuple. If multiple records (at different nodes) exist for 
this name, their tuple will all be stored at the same node. In this way, the index is distributed 
over the nodes at random. For fault tolerance, p different hash functions could be used to store 
each tuple at p nodes, but we will not consider that further here. 


If some user later wants to look up name, he hashes it to get key and then uses successor 
(key) to find the IP address of the node storing its index tuples. The first step is easy; the 
second one is not. To make it possible to find the IP address of the node corresponding to a 
certain key, each node must maintain certain administrative data structures. One of these is 


the IP address of its successor node along the node identifier circle. For example, in Fig. 5-24, 
node 4's successor is 7 and node 7's successor is 12. 


Lookup can now proceed as follows. The requesting node sends a packet to its successor 
containing its IP address and the key it is looking for. The packet is propagated around the ring 
until it locates the successor to the node identifier being sought. That node checks to see if it 
has any information matching the key, and if so, returns it directly to the requesting node, 
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whose IP address it has. 


As a first optimization, each node could hold the IP addresses of both its successor and its 
predecessor, so that queries could be sent either clockwise or counterclockwise, depending on 
which path is thought to be shorter. For example, node 7 in Fig. 5-24 could go clockwise to 
find node identifier 10 but counterclockwise to find node identifier 3. 


Even with two choices of direction, linearly searching all the nodes is very inefficient in a large 
peer-to-peer system since the mean number of nodes required per search is n/2. To greatly 
speed up the search, each node also maintains what Chord calls a finger table. The finger 
table has m entries, indexed by 0 through m - 1, each one pointing to a different actual node. 
Each of the entries has two fields: start and the IP address of successor(start), as shown for 
three example nodes in Fig. 5-24(b). The values of the fields for entry / at node k are: 


start =k +2! (modulo2") 
IP address of successor(start [i ]) 


Note that each node stores the IP addresses of a relatively small number of nodes and that 
most of these are fairly close by in terms of node identifier. 


Using the finger table, the lookup of key at node k proceeds as follows. If key falls between k 
and successor (k), then the node holding information about key is successor (k) and the 
search terminates. Otherwise, the finger table is searched to find the entry whose start field is 
the closest predecessor of key. A request is then sent directly to the IP address in that finger 
table entry to ask it to continue the search. Since it is closer to key but still below it, chances 
are good that it will be able to return the answer with only a small number of additional 
queries. In fact, since every lookup halves the remaining distance to the target, it can be 
shown that the average number of lookups is logon. 


As a first example, consider looking up key = 3 at node 1. Since node 1 knows that 3 lies 
between it and its successor, 4, the desired node is 4 and the search terminates, returning 
node 4's IP address. 


As a second example, consider looking up key = 14 at node 1. Since 14 does not lie between 1 
and 4, the finger table is consulted. The closest predecessor to 14 is 9, so the request is 
forwarded to the IP address of 9's entry, namely, that of node 12. Node 12 sees that 14 falls 
between it and its successor (15), so it returns the IP address of node 15. 


As a third example, consider looking up key = 16 at node 1. Again a query is sent to node 12, 
but this time node 12 does not know the answer itself. It looks for the node most closely 
preceding 16 and finds 14, which yields the IP address of node 15. A query is then sent there. 
Node 15 observes that 16 lies between it and its successor (20), so it returns the IP address of 
20 to the caller, which works its way back to node 1. 


Since nodes join and leave all the time, Chord needs a way to handle these operations. We 
assume that when the system began operation it was small enough that the nodes could just 
exchange information directly to build the first circle and finger tables. After that an automated 
procedure is needed, as follows. When a new node, r, wants to join, it must contact some 
existing node and ask it to look up the IP address of successor (r) for it. The new node then 
asks successor (r) for its predecessor. The new node then asks both of these to insert r in 


between them in the circle. For example, if 24 in Fig. 5-24 wants to join, it asks any node to 
look up successor (24), which is 27. Then it asks 27 for its predecessor (20). After it tells both 
of those about its existence, 20 uses 24 as its successor and 27 uses 24 as its predecessor. In 
addition, node 27 hands over those keys in the range 21-24, which now belong to 24. At this 
point, 24 is fully inserted. 


However, many finger tables are now wrong. To correct them, every node runs a background 
process that periodically recomputes each finger by calling successor. When one of these 
237 


queries hits a new node, the corresponding finger entry is updated. 


When a node leaves gracefully, it hands its keys over to its successor and informs its 
predecessor of its departure so the predecessor can link to the departing node's successor. 
When a node crashes, a problem arises because its predecessor no longer has a valid 
successor. To alleviate this problem, each node keeps track not only of its direct successor but 
also its s direct successors, to allow it to skip over up to s - 1 consecutive failed nodes and 
reconnect the circle. 


Chord has been used to construct a distributed file system (Dabek et al., 2001b) and other 
applications, and research is ongoing. A different peer-to-peer system, Pastry, and its 
applications are described in (Rowstron and Druschel, 2001a; and Rowstron and Druschel, 
2001b). A third peer-to-peer system, Freenet, is discussed in (Clarke et al., 2002). A fourth 
system of this type is described in (Ratnasamy et al., 2001). 


5.3 The Network Layer in the Internet 


Before getting into the specifics of the network layer in the Internet, it is worth taking at look 
at the principles that drove its design in the past and made it the success that it is today. All 
too often, nowadays, people seem to have forgotten them. These principles are enumerated 
and discussed in RFC 1958, which is well worth reading (and should be mandatory for all 
protocol designers—with a final exam at the end). This RFC draws heavily on ideas found in 
(Clark, 1988; and Saltzer et al., 1984). We will now summarize what we consider to be the top 
10 principles (from most important to least important). 


1. Make sure it works. Do not finalize the design or standard until multiple prototypes 
have successfully communicated with each other. All too often designers first write a 


1000-page standard, get it approved, then discover it is deeply flawed and does not 
work. Then they write version 1.1 of the standard. This is not the way to go. 

2. Keep it simple. When in doubt, use the simplest solution. William of Occam stated this 
principle (Occam's razor) in the 14th century. Put in modern terms: fight features. If a 
feature is not absolutely essential, leave it out, especially if the same effect can be 
achieved by combining other features. 

3. Make clear choices. If there are several ways of doing the same thing, choose one. 
Having two or more ways to do the same thing is looking for trouble. Standards often 
have multiple options or modes or parameters because several powerful parties insist 
that their way is best. Designers should strongly resist this tendency. Just say no. 

4. Exploit modularity. This principle leads directly to the idea of having protocol stacks, 
each of whose layers is independent of all the other ones. In this way, if circumstances 
that require one module or layer to be changed, the other ones will not be affected. 

5. Expect heterogeneity. Different types of hardware, transmission facilities, and 
applications will occur on any large network. To handle them, the network design must 
be simple, general, and flexible. 

6. Avoid static options and parameters. If parameters are unavoidable (e.g., 
maximum packet size), it is best to have the sender and receiver negotiate a value than 
defining fixed choices. 

7. Look for a good design; it need not be perfect. Often the designers have a good 
design but it cannot handle some weird special case. Rather than messing up the 
design, the designers should go with the good design and put the burden of working 
around it on the people with the strange requirements. 

8. Be strict when sending and tolerant when receiving. In other words, only send 
packets that rigorously comply with the standards, but expect incoming packets that 
may not be fully conformant and try to deal with them. 

9. Think about scalability. If the system is to handle millions of hosts and billions of 
users effectively, no centralized databases of any kind are tolerable and load must be 
spread as evenly as possible over the available resources. 

10. Consider performance and cost. If a network has poor performance or outrageous 
costs, nobody will use it. 
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Let us now leave the general principles and start looking at the details of the Internet's 
network layer. At the network layer, the Internet can be viewed as a collection of subnetworks 
or Autonomous Systems (ASes) that are interconnected. There is no real structure, but 
several major backbones exist. These are constructed from high-bandwidth lines and fast 
routers. Attached to the backbones are regional (midlevel) networks, and attached to these 
regional networks are the LANs at many universities, companies, and Internet service 
providers. A sketch of this quasi-hierarchical organization is given in Fig. 5-52. 


Figure 5-52. The Internet is an interconnected collection of many 


networks. 
Leased lines Leased A European backbone 
to Y A U.S. backbone transatlantic 
line 


Regional 
network : 
\ National 


N NA 7 network 
2 eot Lx J 


< 
~ Tunnel 


» D 8 Host 
— 
(soy ode —— cogi 
IP rre IP Ethernet 
IP token ring LAN LAN 


The glue that holds the whole Internet together is the network layer protocol, IP (Internet 
Protocol). Unlike most older network layer protocols, it was designed from the beginning with 
internetworking in mind. A good way to think of the network layer is this. Its job is to provide 
a best-efforts (i.e., not guaranteed) way to transport datagrams from source to destination, 
without regard to whether these machines are on the same network or whether there are 
other networks in between them. 


Communication in the Internet works as follows. The transport layer takes data streams and 
breaks them up into datagrams. In theory, datagrams can be up to 64 Kbytes each, but in 
practice they are usually not more than 1500 bytes (so they fit in one Ethernet frame). Each 
datagram is transmitted through the Internet, possibly being fragmented into smaller units as 
it goes. When all the pieces finally get to the destination machine, they are reassembled by the 
network layer into the original datagram. This datagram is then handed to the transport layer, 
which inserts it into the receiving process' input stream. As can be seen from Fig. 5-52, a 
packet originating at host 1 has to traverse six networks to get to host 2. In practice, it is 
often much more than six. 


5.3.1 The IP Protocol 


An appropriate place to start our study of the network layer in the Internet is the format of the 
IP datagrams themselves. An IP datagram consists of a header part and a text part. The 
header has a 20-byte fixed part and a variable length optional part. The header format is 
shown in Fig. 5-53. It is transmitted in big-endian order: from left to right, with the high-order 
bit of the Version field going first. (The SPARC is big endian; the Pentium is little-endian.) On 
little endian machines, software conversion is required on both transmission and reception. 
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Figure 5-53. The IPv4 (Internet Protocol) header. 
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The Version field keeps track of which version of the protocol the datagram belongs to. By 
including the version in each datagram, it becomes possible to have the transition between 
versions take years, with some machines running the old version and others running the new 
one. Currently a transition between IPv4 and IPv6 is going on, has already taken years, and is 
by no means close to being finished (Durand, 2001; Wiljakka, 2002; and Waddington and 
Chang, 2002). Some people even think it will never happen (Weiser, 2001). As an aside on 
numbering, IPv5 was an experimental real-time stream protocol that was never widely used. 


Since the header length is not constant, a field in the header, IHL, is provided to tell how long 
the header is, in 32-bit words. The minimum value is 5, which applies when no options are 
present. The maximum value of this 4-bit field is 15, which limits the header to 60 bytes, and 
thus the Options field to 40 bytes. For some options, such as one that records the route a 
packet has taken, 40 bytes is far too small, making that option useless. 


The Type of service field is one of the few fields that has changed its meaning (slightly) over 
the years. It was and is still intended to distinguish between different classes of service. 
Various combinations of reliability and speed are possible. For digitized voice, fast delivery 
beats accurate delivery. For file transfer, error-free transmission is more important than fast 
transmission. 


Originally, the 6-bit field contained (from left to right), a three-bit Precedence field and three 
flags, D, T, and R. The Precedence field was a priority, from O (normal) to 7 (network control 
packet). The three flag bits allowed the host to specify what it cared most about from the set 
{Delay, Throughput, Reliability}. In theory, these fields allow routers to make choices 
between, for example, a satellite link with high throughput and high delay or a leased line with 
low throughput and low delay. In practice, current routers often ignore the Type of service field 
altogether. 


Eventually, IETF threw in the towel and changed the field slightly to accommodate 
differentiated services. Six of the bits are used to indicate which of the service classes 
discussed earlier each packet belongs to. These classes include the four queueing priorities, 
three discard probabilities, and the historical classes. 


The Total length includes everything in the datagram—both header and data. The maximum 
length is 65,535 bytes. At present, this upper limit is tolerable, but with future gigabit 
networks, larger datagrams may be needed. 


The Identification field is needed to allow the destination host to determine which datagram a 
newly arrived fragment belongs to. All the fragments of a datagram contain the same 
Identification value. 

Next comes an unused bit and then two 1-bit fields. DF stands for Don't Fragment. It is an 
order to the routers not to fragment the datagram because the destination is incapable of 
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putting the pieces back together again. For example, when a computer boots, its ROM might 
ask for a memory image to be sent to it as a single datagram. By marking the datagram with 
the DF bit, the sender knows it will arrive in one piece, even if this means that the datagram 
must avoid a small-packet network on the best path and take a suboptimal route. All machines 
are required to accept fragments of 576 bytes or less. 


MF stands for More Fragments. All fragments except the last one have this bit set. It is needed 
to know when all fragments of a datagram have arrived. 


The Fragment offset tells where in the current datagram this fragment belongs. All fragments 
except the last one in a datagram must be a multiple of 8 bytes, the elementary fragment unit. 
Since 13 bits are provided, there is a maximum of 8192 fragments per datagram, giving a 
maximum datagram length of 65,536 bytes, one more than the Total length field. 


The Time to live field is a counter used to limit packet lifetimes. It is supposed to count time in 
seconds, allowing a maximum lifetime of 255 sec. It must be decremented on each hop and is 
supposed to be decremented multiple times when queued for a long time in a router. In 
practice, it just counts hops. When it hits zero, the packet is discarded and a warning packet is 
sent back to the source host. This feature prevents datagrams from wandering around forever, 
something that otherwise might happen if the routing tables ever become corrupted. 


When the network layer has assembled a complete datagram, it needs to know what to do 
with it. The Protocol field tells it which transport process to give it to. TCP is one possibility, 
but so are UDP and some others. The numbering of protocols is global across the entire 
Internet. Protocols and other assigned numbers were formerly listed in RFC 1700, but 
nowadays they are contained in an on-line data base located at www.iana.org. 


The Header checksum verifies the header only. Such a checksum is useful for detecting errors 
generated by bad memory words inside a router. The algorithm is to add up all the 16-bit 
halfwords as they arrive, using one's complement arithmetic and then take the one's 
complement of the result. For purposes of this algorithm, the Header checksum is assumed to 
be zero upon arrival. This algorithm is more robust than using a normal add. Note that the 
Header checksum must be recomputed at each hop because at least one field always changes 
(the Time to live field), but tricks can be used to speed up the computation. 


The Source address and Destination address indicate the network number and host number. 
We will discuss Internet addresses in the next section. The Options field was designed to 
provide an escape to allow subsequent versions of the protocol to include information not 
present in the original design, to permit experimenters to try out new ideas, and to avoid 
allocating header bits to information that is rarely needed. The options are variable length. 
Each begins with a 1-byte code identifying the option. Some options are followed by a 1-byte 
option length field, and then one or more data bytes. The Options field is padded out to a 
multiple of four bytes. Originally, five options were defined, as listed in Fig. 5-54, but since 
then some new ones have been added. The current complete list is now maintained on-line at 


Www.iana.org/assignments/ip-parameters. 
Figure 5-54. Some of the IP options. 


Option | Description 
Security Specifies how secret the datagram is 
Strict source routing Gives the complete path to be followed 
Loose source routing | Gives a list of routers not to be missed 
Record route | Makes each router append its IP address 


Timestamp Makes each router append its address and timestamp 


The Security option tells how secret the information is. In theory, a military router might use 

this field to specify not to route through certain countries the military considers to be "bad 

guys." In practice, all routers ignore it, so its only practical function is to help spies find the 
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good stuff more easily. 


The Strict source routing option gives the complete path from source to destination as a 
sequence of IP addresses. The datagram is required to follow that exact route. It is most useful 
for system managers to send emergency packets when the routing tables are corrupted, or for 
making timing measurements. 


The Loose source routing option requires the packet to traverse the list of routers specified, 
and in the order specified, but it is allowed to pass through other routers on the way. 
Normally, this option would only provide a few routers, to force a particular path. For example, 
to force a packet from London to Sydney to go west instead of east, this option might specify 
routers in New York, Los Angeles, and Honolulu. This option is most useful when political or 
economic considerations dictate passing through or avoiding certain countries. 


The Record route option tells the routers along the path to append their IP address to the 
option field. This allows system managers to track down bugs in the routing algorithms ("Why 
are packets from Houston to Dallas visiting Tokyo first?" ). When the ARPANET was first set up, 
no packet ever passed through more than nine routers, so 40 bytes of option was ample. As 
mentioned above, now it is too small. 


Finally, the Timestamp option is like the Record route option, except that in addition to 
recording its 32-bit IP address, each router also records a 32-bit timestamp. This option, too, 
is mostly for debugging routing algorithms. 


5.3.2 IP Addresses 


Every host and router on the Internet has an IP address, which encodes its network number 
and host number. The combination is unique: in principle, no two machines on the Internet 
have the same IP address. All IP addresses are 32 bits long and are used in the Source 
address and Destination address fields of IP packets. It is important to note that an IP address 
does not actually refer to a host. It really refers to a network interface, so if a host is on two 
networks, it must have two IP addresses. However, in practice, most hosts are on one network 
and thus have one IP address. 


For several decades, IP addresses were divided into the five categories listed in Fig. 5-55. This 
allocation has come to be called classful addressing.Itisno longer used, but references to it 
in the literature are still common. We will discuss the replacement of classful addressing 
shortly. 


Figure 5-55. I P address formats. 
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Range of host 
Class addresses 

1.0.0.0 to 

A |0 Network Host 127.255.255.255 
191.255.255.255 
192.0.0. 

C | 110 Network Host 203056 256.055 
224.0.0.0 to 

D | 1110 | Multicast address 350.255.295.255 


240.0.0.0 to 
E 1111 Reserved for future use 255.255.255.255 


The class A, B, C, and D formats allow for up to 128 networks with 16 million hosts each, 
16,384 networks with up to 64K hosts, and 2 million networks (e.g., LANs) with up to 256 
hosts each (although a few of these are special). Also supported is multicast, in which a 
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datagram is directed to multiple hosts. Addresses beginning with 1111 are reserved for future 
use. Over 500,000 networks are now connected to the Internet, and the number grows every 
year. Network numbers are managed by a nonprofit corporation called ICANN (Internet 
Corporation for Assigned Names and Numbers) to avoid conflicts. In turn, ICANN has 
delegated parts of the address space to various regional authorities, which then dole out IP 
addresses to ISPs and other companies. 


Network addresses, which are 32-bit numbers, are usually written in dotted decimal 
notation. In this format, each of the 4 bytes is written in decimal, from O to 255. For 
example, the 32-bit hexadecimal address C0290614 is written as 192.41.6.20. The lowest IP 
address is 0.0.0.0 and the highest is 255.255.255.255, 


The values 0 and -1 (all 1s) have special meanings, as shown in Fig. 5-56. The value 0 means 


this network or this host. The value of -1 is used as a broadcast address to mean all hosts on 
the indicated network. 


Figure 5-56. Special IP addresses. 


000000000000000000000000000000000|This host 


Broadcast on the 
1111114141171171111111171111711171114141 local network 


Broadcast on a 
distant network 


| Network ETT eoi 1111 


127 (Anything) Loopback 


The IP address 0.0.0.0 is used by hosts when they are being booted. IP addresses with 0 as 
network number refer to the current network. These addresses allow machines to refer to their 
own network without knowing its number (but they have to know its class to know how many 
Os to include). The address consisting of all 1s allows broadcasting on the local network, 
typically a LAN. The addresses with a proper network number and all 1s in the host field allow 
machines to send broadcast packets to distant LANs anywhere in the Internet (although many 
network administrators disable this feature). Finally, all addresses of the form 127.xx.yy.zz are 
reserved for loopback testing. Packets sent to that address are not put out onto the wire; they 
are processed locally and treated as incoming packets. This allows packets to be sent to the 
local network without the sender knowing its number. 


Subnets 


As we have seen, all the hosts in a network must have the same network number. This 
property of IP addressing can cause problems as networks grow. For example, consider a 
university that started out with one class B network used by the Computer Science Dept. for 
the computers on its Ethernet. A year later, the Electrical Engineering Dept. wanted to get on 
the Internet, so they bought a repeater to extend the CS Ethernet to their building. As time 
went on, many other departments acquired computers and the limit of four repeaters per 
Ethernet was quickly reached. A different organization was required. 


Getting a second network address would be hard to do since network addresses are scarce and 
the university already had enough addresses for over 60,000 hosts. The problem is the rule 
that a single class A, B, or C address refers to one network, not to a collection of LANs. As 
more and more organizations ran into this situation, a small change was made to the 
addressing system to deal with it. 


The solution is to allow a network to be split into several parts for internal use but still act like 
a single network to the outside world. A typical campus network nowadays might look like that 
of Fig. 5-57, with a main router connected to an ISP or regional network and numerous 
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Ethernets spread around campus in different departments. Each of the Ethernets has its own 
router connected to the main router (possibly via a backbone LAN, but the nature of the 
interrouter connection is not relevant here). 


Figure 5-57. A campus network consisting of LANs for various 
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In the Internet literature, the parts of the network (in this case, Ethernets) are called 
subnets. As we mentioned in Chap. 1, this usage conflicts with "subnet" to mean the set of all 
routers and communication lines in a network. Hopefully, it will be clear from the context 
which meaning is intended. In this section and the next one, the new definition will be the one 
used exclusively. 


When a packet comes into the main router, how does it know which subnet (Ethernet) to give 
it to? One way would be to have a table with 65,536 entries in the main router telling which 
router to use for each host on campus. This idea would work, but it would require a very large 
table in the main router and a lot of manual maintenance as hosts were added, moved, or 
taken out of service. 


Instead, a different scheme was invented. Basically, instead of having a single class B address 
with 14 bits for the network number and 16 bits for the host number, some bits are taken 

away from the host number to create a subnet number. For example, if the university has 35 
departments, it could use a 6-bit subnet number and a 10-bit host number, allowing for up to 


64 Ethernets, each with a maximum of 1022 hosts (0 and -1 are not available, as mentioned 
earlier). This split could be changed later if it turns out to be the wrong one. 


To implement subnetting, the main router needs a subnet mask that indicates the split 
between network + subnet number and host, as shown in Fig. 5-58. Subnet masks are also 
written in dotted decimal notation, with the addition of a slash followed by the number of bits 
in the network + subnet part. For the example of Fig. 5-58, the subnet mask can be written as 
255.255.252.0. An alternative notation is /22 to indicate that the subnet mask is 22 bits long. 


Figure 5-58. A class B network subnetted into 64 subnets. 
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Outside the network, the subnetting is not visible, so allocating a new subnet does not require 
contacting ICANN or changing any external databases. In this example, the first subnet might 
use IP addresses starting at 130.50.4.1; the second subnet might start at 130.50.8.1; the 
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third subnet might start at 130.50.12.1; and so on. To see why the subnets are counting by 
fours, note that the corresponding binary addresses are as follows: 


Subnet 1: 10000010 00110010 000001100. 00000001 
Subnet 2: 10000010 00110010 000010/00 00000001 
Subnet 3: 10000010 00110010 000011/00 00000001 


Here the vertical bar (|) shows the boundary between the subnet number and the host 
number. To its left is the 6-bit subnet number; to its right is the 10-bit host number. 


To see how subnets work, it is necessary to explain how IP packets are processed at a router. 
Each router has a table listing some number of (network, 0) IP addresses and some number of 
(this-network, host) IP addresses. The first kind tells how to get to distant networks. The 
second kind tells how to get to local hosts. Associated with each table is the network interface 
to use to reach the destination, and certain other information. 


When an IP packet arrives, its destination address is looked up in the routing table. If the 
packet is for a distant network, it is forwarded to the next router on the interface given in the 
table. If it is a local host (e.g., on the router's LAN), it is sent directly to the destination. If the 
network is not present, the packet is forwarded to a default router with more extensive tables. 
This algorithm means that each router only has to keep track of other networks and local 
hosts, not (network, host) pairs, greatly reducing the size of the routing table. 


When subnetting is introduced, the routing tables are changed, adding entries of the form 
(this-network, subnet, 0) and (this-network, this-subnet, host). Thus, a router on subnet k 
knows how to get to all the other subnets and also how to get to all the hosts on subnet k. It 
does not have to know the details about hosts on other subnets. In fact, all that needs to be 
changed is to have each router do a Boolean AND with the network's subnet mask to get rid of 
the host number and look up the resulting address in its tables (after determining which 
network class it is). For example, a packet addressed to 130.50.15.6 and arriving at the main 
router is ANDed with the subnet mask 255.255.252.0/22 to give the address 130.50.12.0. This 
address is looked up in the routing tables to find out which output line to use to get to the 
router for subnet 3. Subnetting thus reduces router table space by creating a three-level 
hierarchy consisting of network, subnet, and host. 


CIDR —Classless InterDomain Routing 


IP has been in heavy use for decades. It has worked extremely well, as demonstrated by the 
exponential growth of the Internet. Unfortunately, IP is rapidly becoming a victim of its own 

popularity: it is running out of addresses. This looming disaster has sparked a great deal of 

discussion and controversy within the Internet community about what to do about it. In this 
section we will describe both the problem and several proposed solutions. 


Back in 1987, a few visionaries predicted that some day the Internet might grow to 100,000 
networks. Most experts pooh-poohed this as being decades in the future, if ever. The 
100,000th network was connected in 1996. The problem, as mentioned above, is that the 
Internet is rapidly running out of IP addresses. In principle, over 2 billion addresses exist, but 
the practice of organizing the address space by classes (see Fig. 5-55) wastes millions of 
them. In particular, the real villain is the class B network. For most organizations, a class A 
network, with 16 million addresses is too big, and a class C network, with 256 addresses is too 
small. A class B network, with 65,536, is just right. In Internet folklore, this situation is known 
as the three bears problem (as in Goldilocks and the Three Bears). 


In reality, a class B address is far too large for most organizations. Studies have shown that 
more than half of all class B networks have fewer than 50 hosts. A class C network would have 
done the job, but no doubt every organization that asked for a class B address thought that 
one day it would outgrow the 8-bit host field. In retrospect, it might have been better to have 
had class C networks use 10 bits instead of eight for the host number, allowing 1022 hosts per 
network. Had this been the case, most organizations would have probably settled for a class C 
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network, and there would have been half a million of them (versus only 16,384 class B 
networks). 


It is hard to fault the Internet designers for not having provided more (and smaller) class B 
addresses. At the time the decision was made to create the three classes, the Internet was a 
research network connecting the major research universities in the U.S. (plus a very small 
number of companies and military sites doing networking research). No one then perceived the 
Internet as becoming a mass market communication system rivaling the telephone network. At 
the time, someone no doubt said: "The U.S. has about 2000 colleges and universities. Even if 
all of them connect to the Internet and many universities in other countries join, too, we are 
never going to hit 16,000 since there are not that many universities in the whole world. 
Furthermore, having the host number be an integral number of bytes speeds up packet 
processing." 


However, if the split had allocated 20 bits to the class B network number, another problem 
would have emerged: the routing table explosion. From the point of view of the routers, the IP 
address space is a two-level hierarchy, with network numbers and host numbers. Routers do 
not have to know about all the hosts, but they do have to know about all the networks. If half 
a million class C networks were in use, every router in the entire Internet would need a table 
with half a million entries, one per network, telling which line to use to get to that network, as 
well as providing other information. 


The actual physical storage of half a million entry tables is probably doable, although 
expensive for critical routers that keep the tables in static RAM on I/O boards. A more serious 
problem is that the complexity of various algorithms relating to management of the tables 
grows faster than linear. Worse yet, much of the existing router software and firmware was 
designed at a time when the Internet had 1000 connected networks and 10,000 networks 
seemed decades away. Design choices made then often are far from optimal now. 


In addition, various routing algorithms require each router to transmit its tables periodically 
(e.g., distance vector protocols). The larger the tables, the more likely it is that some parts will 
get lost underway, leading to incomplete data at the other end and possibly routing 
instabilities. 


The routing table problem could have been solved by going to a deeper hierarchy. For 
example, having each IP address contain a country, state/province, city, network, and host 
field might work. Then each router would only need to know how to get to each country, the 
states or provinces in its own country, the cities in its state or province, and the networks in its 
city. Unfortunately, this solution would require considerably more than 32 bits for IP addresses 
and would use addresses inefficiently (Liechtenstein would have as many bits as the United 
States). 


In short, some solutions solve one problem but create a new one. The solution that was 
implemented and that gave the Internet a bit of extra breathing room is CIDR (Classless 
InterDomain Routing). The basic idea behind CIDR, which is described in RFC 1519, is to 
allocate the remaining IP addresses in variable-sized blocks, without regard to the classes. If a 
site needs, say, 2000 addresses, it is given a block of 2048 addresses on a 2048-byte 
boundary. 


Dropping the classes makes forwarding more complicated. In the old classful system, 
forwarding worked like this. When a packet arrived at a router, a copy of the IP address was 
shifted right 28 bits to yield a 4-bit class number. A 16-way branch then sorted packets into A, 
B, C, and D (if supported), with eight of the cases for class A, four of the cases for class B, two 
of the cases for class C, and one each for D and E. The code for each class then masked off the 
8-, 16-, or 24-bit network number and right aligned it in a 32-bit word. The network number 
was then looked up in the A, B, or C table, usually by indexing for A and B networks and 
hashing for C networks. Once the entry was found, the outgoing line could be looked up and 
the packet forwarded. 
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With CIDR, this simple algorithm no longer works. Instead, each routing table entry is 
extended by giving it a 32-bit mask. Thus, there is now a single routing table for all networks 
consisting of an array of (IP address, subnet mask, outgoing line) triples. When a packet 
comes in, its destination IP address is first extracted. Then (conceptually) the routing table is 
scanned entry by entry, masking the destination address and comparing it to the table entry 
looking for a match. It is possible that multiple entries (with different subnet mask lengths) 
match, in which case the longest mask is used. Thus, if there is a match for a /20 mask and a 
/24 mask, the /24 entry is used. 


Complex algorithms have been devised to speed up the address matching process (Ruiz- 
Sanchez et al., 2001). Commercial routers use custom VLSI chips with these algorithms 
embedded in hardware. 


To make the forwarding algorithm easier to understand, let us consider an example in which 
millions of addresses are available starting at 194.24.0.0. Suppose that Cambridge University 
needs 2048 addresses and is assigned the addresses 194.24.0.0 through 194.24.7.255, along 
with mask 255.255.248.0. Next, Oxford University asks for 4096 addresses. Since a block of 
4096 addresses must lie on a 4096-byte boundary, they cannot be given addresses starting at 
194.24.8.0. Instead, they get 194.24.16.0 through 194.24.31.255 along with subnet mask 
255.255.240.0. Now the University of Edinburgh asks for 1024 addresses and is assigned 
addresses 194.24.8.0 through 194.24.11.255 and mask 255.255.252.0. These assignments 
are summarized in Fig. 5-59. 


Figure 5-59. A set of IP address assignments. 
University First address Last address | How many | Written as 
Cambridge | 194.24.0.0 194.24.7.255 2048 194.24.0.0/21 
Edinburgh | 194.24.8B0 | 194.24.11.255 1024 194.24.8.0/22 


(Available) | 194.24.12.0 | 194.24.15.255 1024 194.24.12/22 
Oxford 194.24.16.0 194.24.31.255 4096 194.24.16.0/20 


The routing tables all over the world are now updated with the three assigned entries. Each 
entry contains a base address and a subnet mask. These entries (in binary) are: 


Address Mask 

11000010 00011000 00000000 00000000 11111111 11111111 11111000 00000000 
11000010 00011000 00001000 00000000 11111111 11111111 11111100 00000000 
11000010 00011000 00010000 00000000 11111111 11111111 11110000 00000000 


O BE OQ 


Now consider what happens when a packet comes in addressed to 194.24.17.4, which in 
binary is represented as the following 32-bit string 


11000010 00011000 00010001 00000100 
First it is Boolean ANDed with the Cambridge mask to get 
11000010 00011000 00010000 00000000 


This value does not match the Cambridge base address, so the original address is next ANDed 
with the Edinburgh mask to get 


11000010 00011000 00010000 00000000 


This value does not match the Edinburgh base address, so Oxford is tried next, yielding 


11000010 00011000 00010000 00000000 
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This value does match the Oxford base. If no longer matches are found farther down the table, 
the Oxford entry is used and the packet is sent along the line named in it. 


Now let us look at these three universities from the point of view of a router in Omaha, 
Nebraska, that has only four outgoing lines: Minneapolis, New York, Dallas, and Denver. When 
the router software there gets the three new entries, it notices that it can combine all three 
entries into a single aggregate entry 194.24.0.0/19 with a binary address and submask as 
follows: 


11000010 0000000 00000000 00000000 11111111 11111111 11100000 00000000 


This entry sends all packets destined for any of the three universities to New York. By 
aggregating the three entries, the Omaha router has reduced its table size by two entries. 


If New York has a single line to London for all U.K. traffic, it can use an aggregated entry as 
well. However, if it has separate lines for London and Edinburgh, then it has to have three 
separate entries. Aggregation is heavily used throughout the Internet to reduce the size of the 
router tables. 


As a final note on this example, the aggregate route entry in Omaha also sends packets for the 
unassigned addresses to New York. As long as the addresses are truly unassigned, this does 
not matter because they are not supposed to occur. However, if they are later assigned to a 
company in California, an additional entry, 194.24.12.0/22, will be needed to deal with them. 


NAT—Network Address Translation 


IP addresses are scarce. An ISP might have a /16 (formerly class B) address, giving it 65,534 
host numbers. If it has more customers than that, it has a problem. For home customers with 
dial-up connections, one way around the problem is to dynamically assign an IP address to a 
computer when it calls up and logs in and take the IP address back when the session ends. In 


this way, a single /16 address can handle up to 65,534 active users, which is probably good 
enough for an ISP with several hundred thousand customers. When the session is terminated, 
the IP address is reassigned to another caller. While this strategy works well for an ISP with a 
moderate number of home users, it fails for ISPs that primarily serve business customers. 


The problem is that business customers expect to be on-line continuously during business 
hours. Both small businesses, such as three-person travel agencies, and large corporations 
have multiple computers connected by a LAN. Some computers are employee PCs; others may 
be Web servers. Generally, there is a router on the LAN that is connected to the ISP by a 
leased line to provide continuous connectivity. This arrangement means that each computer 
must have its own IP address all day long. In effect, the total number of computers owned by 
all its business customers combined cannot exceed the number of IP addresses the ISP has. 
For a /16 address, this limits the total number of computers to 65,534. For an ISP with tens of 
thousands of business customers, this limit will quickly be exceeded. 


To make matters worse, more and more home users are subscribing to ADSL or Internet over 
cable. Two of the features of these services are (1) the user gets a permanent IP address and 
(2) there is no connect charge (just a monthly flat rate charge), so many ADSL and cable 
users just stay logged in permanently. This development just adds to the shortage of IP 
addresses. Assigning IP addresses on-the-fly as is done with dial-up users is of no use because 
the number of IP addresses in use at any one instant may be many times the number the ISP 
owns. 


And just to make it a bit more complicated, many ADSL and cable users have two or more 
computers at home, often one for each family member, and they all want to be on-line all the 
time using the single IP address their ISP has given them. The solution here is to connect all 
the PCs via a LAN and put a router on it. From the ISP's point of view, the family is now the 
same as a small business with a handful of computers. Welcome to Jones, Inc. 
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The problem of running out of IP addresses is not a theoretical problem that might occur at 
some point in the distant future. It is happening right here and right now. The long-term 
solution is for the whole Internet to migrate to IPv6, which has 128-bit addresses. This 
transition is slowly occurring, but it will be years before the process is complete. As a 
consequence, some people felt that a quick fix was needed for the short term. This quick fix 
came in the form of NAT (Network Address Translation), which is described in RFC 3022 
and which we will summarize below. For additional information, see (Dutcher, 2001). 


The basic idea behind NAT is to assign each company a single IP address (or at most, a small 
number of them) for Internet traffic. Within the company, every computer gets a unique IP 
address, which is used for routing intramural traffic. However, when a packet exits the 
company and goes to the ISP, an address translation takes place. To make this scheme 
possible, three ranges of IP addresses have been declared as private. Companies may use 
them internally as they wish. The only rule is that no packets containing these addresses may 
appear on the Internet itself. The three reserved ranges are: 


10,.0.0.0 = 10.255.255.25578 (16,777,216 hosts) 
L72:154.0.0."— L12:29142.255.255712 (1,048,576 hosts) 
192.168.0.0 = 192.168.255.255/16 (65,536 Hosts) 


The first range provides for 16,777,216 addresses (except for 0 and -1, as usual) and is the 
usual choice of most companies, even if they do not need so many addresses. 


The operation of NAT is shown in Fig. 5-60. Within the company premises, every machine has 
a unique address of the form 10.x.y.z. However, when a packet leaves the company premises, 
it passes through a NAT box that converts the internal IP source address, 10.0.0.1 in the 
figure, to the company's true IP address, 198.60.42.12 in this example. The NAT box is often 
combined in a single device with a firewall, which provides security by carefully controlling 


what goes into the company and what comes out. We will study firewalls in Chap. 8. It is also 
possible to integrate the NAT box into the company's router. 


Figure 5-60. Placement and operation of a NAT box. 
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So far we have glossed over one tiny little detail: when the reply comes back (e.g., from a 
Web server), it is naturally addressed to 198.60.42.12, so how does the NAT box know which 
address to replace it with? Herein lies the problem with NAT. If there were a spare field in the 
IP header, that field could be used to keep track of who the real sender was, but only 1 bit is 
still unused. In principle, a new option could be created to hold the true source address, but 
doing so would require changing the IP code on all the machines on the entire Internet to 
handle the new option. This is not a promising alternative for a quick fix. 


What actually happened is as follows. The NAT designers observed that most IP packets carry 
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either TCP or UDP payloads. When we study TCP and UDP in Chap. 6, we will see that both of 
these have headers containing a source port and a destination port. Below we will just discuss 
TCP ports, but exactly the same story holds for UDP ports. The ports are 16-bit integers that 
indicate where the TCP connection begins and ends. These ports provide the field needed to 
make NAT work. 


When a process wants to establish a TCP connection with a remote process, it attaches itself to 
an unused TCP port on its own machine. This is called the source port and tells the TCP code 
where to send incoming packets belonging to this connection. The process also supplies a 
destination port to tell who to give the packets to on the remote side. Ports 0-1023 are 
reserved for well-known services. For example, port 80 is the port used by Web servers, so 
remote clients can locate them. Each outgoing TCP message contains both a source port and a 
destination port. Together, these ports serve to identify the processes using the connection on 
both ends. 


An analogy may make the use of ports clearer. Imagine a company with a single main 
telephone number. When people call the main number, they reach an operator who asks which 
extension they want and then puts them through to that extension. The main number is 
analogous to the company's IP address and the extensions on both ends are analogous to the 
ports. Ports are an extra 16-bits of addressing that identify which process gets which incoming 
packet. 


Using the Source port field, we can solve our mapping problem. Whenever an outgoing packet 
enters the NAT box, the 10.x.y.z source address is replaced by the company's true IP address. 
In addition, the TCP Source port field is replaced by an index into the NAT box's 65,536-entry 

translation table. This table entry contains the original IP address and the original source port. 
Finally, both the IP and TCP header checksums are recomputed and inserted into the packet. It 
is necessary to replace the Source port because connections from machines 10.0.0.1 and 


10.0.0.2 may both happen to use port 5000, for example, so the Source port alone is not 
enough to identify the sending process. 


When a packet arrives at the NAT box from the ISP, the Source port in the TCP header is 
extracted and used as an index into the NAT box's mapping table. From the entry located, the 
internal IP address and original TCP Source port are extracted and inserted into the packet. 
Then both the IP and TCP checksums are recomputed and inserted into the packet. The packet 
is then passed to the company router for normal delivery using the 10.x.y.z address. 


NAT can also be used to alleviate the IP shortage for ADSL and cable users. When the ISP 
assigns each user an address, it uses 10.x.y.z addresses. When packets from user machines 
exit the ISP and enter the main Internet, they pass through a NAT box that translates them to 
the ISP's true Internet address. On the way back, packets undergo the reverse mapping. In 
this respect, to the rest of the Internet, the ISP and its home ADSL/cable users just looks like 
a big company. 


Although this scheme sort of solves the problem, many people in the IP community regard it 
as an abomination-on-the-face-of-the-earth. Briefly summarized, here are some of the 
objections. First, NAT violates the architectural model of IP, which states that every IP address 
uniquely identifies a single machine worldwide. The whole software structure of the Internet is 
built on this fact. With NAT, thousands of machines may (and do) use address 10.0.0.1. 


Second, NAT changes the Internet from a connectionless network to a kind of connection- 
oriented network. The problem is that the NAT box must maintain information (the mapping) 
for each connection passing through it. Having the network maintain connection state is a 
property of connection-oriented networks, not connectionless ones. If the NAT box crashes and 
its mapping table is lost, all its TCP connections are destroyed. In the absence of NAT, router 
crashes have no effect on TCP. The sending process just times out within a few seconds and 
retransmits all unacknowledged packets. With NAT, the Internet becomes as vulnerable as a 
circuit-switched network. 
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Third, NAT violates the most fundamental rule of protocol layering: layer k may not make any 
assumptions about what layer k + 1 has put into the payload field. This basic principle is there 
to keep the layers independent. If TCP is later upgraded to TCP-2, with a different header 
layout (e.g., 32-bit ports), NAT will fail. The whole idea of layered protocols is to ensure that 
changes in one layer do not require changes in other layers. NAT destroys this independence. 


Fourth, processes on the Internet are not required to use TCP or UDP. If a user on machine A 

decides to use some new transport protocol to talk to a user on machine B (for example, for a 
multimedia application), introduction of a NAT box will cause the application to fail because the 
NAT box will not be able to locate the TCP Source port correctly. 


Fifth, some applications insert IP addresses in the body of the text. The receiver then extracts 
these addresses and uses them. Since NAT knows nothing about these addresses, it cannot 
replace them, so any attempt to use them on the remote side will fail. FTP, the standard File 
Transfer Protocol works this way and can fail in the presence of NAT unless special 
precautions are taken. Similarly, the H.323 Internet telephony protocol (which we will study in 
Chap. 7) has this property and can fail in the presence of NAT. It may be possible to patch NAT 
to work with H.323, but having to patch the code in the NAT box every time a new application 
comes along is not a good idea. 


Sixth, since the TCP Source port field is 16 bits, at most 65,536 machines can be mapped onto 
an IP address. Actually, the number is slightly less because the first 4096 ports are reserved 
for special uses. However, if multiple IP addresses are available, each one can handle up to 
61,440 machines. 


These and other problems with NAT are discussed in RFC 2993. In general, the opponents of 
NAT say that by fixing the problem of insufficient IP addresses with a temporary and ugly 
hack, the pressure to implement the real solution, that is, the transition to IPv6, is reduced, 
and this is a bad thing. 


5.3.3 Internet Control Protocols 


In addition to IP, which is used for data transfer, the Internet has several control protocols 
used in the network layer, including ICMP, ARP, RARP, BOOTP, and DHCP. In this section we 
will look at each of these in turn. 


The Internet Control Message Protocol 


The operation of the Internet is monitored closely by the routers. When something unexpected 
occurs, the event is reported by the ICMP (Internet Control Message Protocol), which is 
also used to test the Internet. About a dozen types of ICMP messages are defined. The most 
important ones are listed in Fig. 5-61. Each ICMP message type is encapsulated in an IP 
packet. 


Figure 5-61. The principal ICMP message types. 


Message type Description 
Destination unreachable ^ Packet could not be delivered 
Time exceeded _ Time to live field hit 0 
Parameter problem _ Invalid header field 
Source quench Choke packet 
Redirect Teach a router about geography 
Echo Ask a machine if it is alive 
Echo reply | Yes, | am alive 
Timestamp request | Same as Echo request, but with timestamp 
Timestamp reply Same as Echo reply, but with timestamp 


The DESTINATION UNREACHABLE message is used when the subnet or a router cannot locate 
the destination or when a packet with the DF bit cannot be delivered because a "small-packet" 
network stands in the way. 


The TIME EXCEEDED message is sent when a packet is dropped because its counter has 
reached zero. This event is a symptom that packets are looping, that there is enormous 
congestion, or that the timer values are being set too low. 


The PARAMETER PROBLEM message indicates that an illegal value has been detected in a 
header field. This problem indicates a bug in the sending host'sIP software or possibly in the 
software of a router transited. 


The SOURCE QUENCH message was formerly used to throttle hosts that were sending too 
many packets. When a host received this message, it was expected to slow down. It is rarely 
used any more because when congestion occurs, these packets tend to add more fuel to the 
fire. Congestion control in the Internet is now done largely in the transport layer; we will study 
it in detail in Chap. 6. 


The REDIRECT message is used when a router notices that a packet seems to be routed 
wrong. It is used by the router to tell the sending host about the probable error. 


The ECHO and ECHO REPLY messages are used to see if a given destination is reachable and 
alive. Upon receiving the ECHO message, the destination is expected to send an ECHO REPLY 
message back. The TIMESTAMP REQUEST and TIMESTAMP REPLY messages are similar, except 
that the arrival time of the message and the departure time of the reply are recorded in the 
reply. This facility is used to measure network performance. 


In addition to these messages, others have been defined. The on-line list is now kept at 
www.iana.org/assignments/icmp-parameters. 


ARP—The Address Resolution Protocol 


Although every machine on the Internet has one (or more) IP addresses, these cannot actually 
be used for sending packets because the data link layer hardware does not understand 
Internet addresses. Nowadays, most hosts at companies and universities are attached to a 
LAN by an interface board that only understands LAN addresses. For example, every Ethernet 
board ever manufactured comes equipped with a 48-bit Ethernet address. Manufacturers of 
Ethernet boards request a block of addresses from a central authority to ensure that no two 
boards have the same address (to avoid conflicts should the two boards ever appear on the 
same LAN). The boards send and receive frames based on 48-bit Ethernet addresses. They 
know nothing at all about 32-bit IP addresses. 


The question now arises: How do IP addresses get mapped onto data link layer addresses, 
such as Ethernet? To explain how this works, let us use the example of Fig. 5-62, in which a 
small university with several class C (now called /24) networks is illustrated. Here we have two 
Ethernets, one in the Computer Science Dept., with IP address 192.31.65.0 and one in 
Electrical Engineering, with IP address 192.31.63.0. These are connected by a campus 
backbone ring (e.g., FDDI) with IP address 192.31.60.0. Each machine on an Ethernet has a 
unique Ethernet address, labeled E1 through E6, and each machine on the FDDI ring has an 
FDDI address, labeled F1 through F3. 
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Figure 5-62. Three interconnected /24 networks: two Ethernets and an 
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Let us start out by seeing how a user on host 1 sends a packet to a user on host 2. Let us 
assume the sender knows the name of the intended receiver, possibly something like 
mary@eagle.cs.uni.edu. The first step is to find the IP address for host 2, known as 
eagle.cs.uni.edu. This lookup is performed by the Domain Name System, which we will study 
in Chap. 7. For the moment, we will just assume that DNS returns the IP address for host 2 
(192.31.65.5). 


The upper layer software on host 1 now builds a packet with 192.31.65.5 in the Destination 
address field and gives it to the IP software to transmit. The IP software can look at the 
address and see that the destination is on its own network, but it needs some way to find the 


destination's Ethernet address. One solution is to have a configuration file somewhere in the 
system that maps IP addresses onto Ethernet addresses. While this solution is certainly 
possible, for organizations with thousands of machines, keeping all these files up to date is an 
error-prone, time-consuming job. 


A better solution is for host 1 to output a broadcast packet onto the Ethernet asking: Who 
owns IP address 192.31.65.5? The broadcast will arrive at every machine on Ethernet 
192.31.65.0, and each one will check its IP address. Host 2 alone will respond with its Ethernet 
address (E2). In this way host 1 learns that IP address 192.31.65.5 is on the host with 
Ethernet address E2. The protocol used for asking this question and getting the reply is called 
ARP (Address Resolution Protocol). Almost every machine on the Internet runs it. ARP is 
defined in RFC 826. 


The advantage of using ARP over configuration files is the simplicity. The system manager 
does not have to do much except assign each machine an IP address and decide about subnet 
masks. ARP does the rest. 


At this point, the IP software on host 1 builds an Ethernet frame addressed to E2, puts the IP 
packet (addressed to 192.31.65.5) in the payload field, and dumps it onto the Ethernet. The 
Ethernet board of host 2 detects this frame, recognizes it as a frame for itself, scoops it up, 
and causes an interrupt. The Ethernet driver extracts the IP packet from the payload and 
passes it to the IP software, which sees that it is correctly addressed and processes it. 


Various optimizations are possible to make ARP work more efficiently. To start with, once a 
machine has run ARP, it caches the result in case it needs to contact the same machine 
shortly. Next time it will find the mapping in its own cache, thus eliminating the need for a 
second broadcast. In many cases host 2 will need to send back a reply, forcing it, too, to run 
ARP to determine the sender's Ethernet address. This ARP broadcast can be avoided by having 
host 1 include its IP-to- Ethernet mapping in the ARP packet. When the ARP broadcast arrives 
at host 2, the pair (192.31.65.7, E1) is entered into host 2's ARP cache for future use. In fact, 
all machines on the Ethernet can enter this mapping into their ARP caches. 
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Yet another optimization is to have every machine broadcast its mapping when it boots. This 
broadcast is generally done in the form of an ARP looking for its own IP address. There should 
not be a response, but a side effect of the broadcast is to make an entry in everyone's ARP 
cache. If a response does (unexpectedly) arrive, two machines have been assigned the same 
IP address. The new one should inform the system manager and not boot. 


To allow mappings to change, for example, when an Ethernet board breaks and is replaced 
with a new one (and thus a new Ethernet address), entries in the ARP cache should time out 
after a few minutes. 


Now let us look at Fig. 5-62 again, only this time host 1 wants to send a packet to host 4 
(192.31.63.8). Using ARP will fail because host 4 will not see the broadcast (routers do not 
forward Ethernet-level broadcasts). There are two solutions. First, the CS router could be 
configured to respond to ARP requests for network 192.31.63.0 (and possibly other local 
networks). In this case, host 1 will make an ARP cache entry of (192.31.63.8, E3) and happily 
send all traffic for host 4 to the local router. This solution is called proxy ARP. The second 
solution is to have host 1 immediately see that the destination is on a remote network and just 
send all such traffic to a default Ethernet address that handles all remote traffic, in this case 
E3. This solution does not require having the CS router know which remote networks it is 
serving. 


Either way, what happens is that host 1 packs the IP packet into the payload field of an 

Ethernet frame addressed to E3. When the CS router gets the Ethernet frame, it removes the 
IP packet from the payload field and looks up the IP address in its routing tables. It discovers 
that packets for network 192.31.63.0 are supposed to go to router 192.31.60.7. If it does not 


already know the FDDI address of 192.31.60.7, it broadcasts an ARP packet onto the ring and 
learns that its ring address is F3. It then inserts the packet into the payload field of an FDDI 
frame addressed to F3 and puts it on the ring. 


At the EE router, the FDDI driver removes the packet from the payload field and gives it to the 
IP software, which sees that it needs to send the packet to 192.31.63.8. If this IP address is 
not in its ARP cache, it broadcasts an ARP request on the EE Ethernet and learns that the 
destination address is E6,soit builds an Ethernet frame addressed to E6, puts the packet in the 
payload field, and sends it over the Ethernet. When the Ethernet frame arrives at host 4, the 
packet is extracted from the frame and passed to the IP software for processing. 


Going from host 1 to a distant network over a WAN works essentially the same way, except 
that this time the CS router's tables tell it to use the WAN router whose FDDI address is F2. 


RARP, BOOTP, and DHCP 


ARP solves the problem of finding out which Ethernet address corresponds to a given IP 
address. Sometimes the reverse problem has to be solved: Given an Ethernet address, what is 
the corresponding IP address? In particular, this problem occurs when a diskless workstation is 
booted. Such a machine will normally get the binary image of its operating system from a 
remote file server. But how does it learn its IP address? 


The first solution devised was to use RARP (Reverse Address Resolution Protocol) 
(defined in RFC 903). This protocol allows a newly-booted workstation to broadcast its 
Ethernet address and say: My 48-bit Ethernet address is 14.04.05.18.01.25. Does anyone out 
there know my IP address? The RARP server sees this request, looks up the Ethernet address 
in its configuration files, and sends back the corresponding IP address. 


Using RARP is better than embedding an IP address in the memory image because it allows the 
same image to be used on all machines. If the IP address were buried inside the image, each 
workstation would need its own image. 


A disadvantage of RARP is that it uses a destination address of all 1s (limited broadcasting) to 


reach the RARP server. However, such broadcasts are not forwarded by routers, so a RARP 
server is needed on each network. To get around this problem, an alternative bootstrap 
protocol called BOOTP was invented. Unlike RARP, BOOTP uses UDP messages, which are 
forwarded over routers. It also provides a diskless workstation with additional information, 
including the IP address of the file server holding the memory image, the IP address of the 
default router, and the subnet mask to use. BOOTP is described in RFCs 951, 1048, and 1084. 


A serious problem with BOOTP is that it requires manual configuration of tables mapping IP 
address to Ethernet address. When a new host is added to a LAN, it cannot use BOOTP until an 
administrator has assigned it an IP address and entered its (Ethernet address, IP address) into 
the BOOTP configuration tables by hand. To eliminate this error-prone step, BOOTP was 
extended and given a new name: DHCP (Dynamic Host Configuration Protocol). DHCP 
allows both manual IP address assignment and automatic assignment. It is described in RFCs 
2131 and 2132. In most systems, it has largely replaced RARP and BOOTP. 


Like RARP and BOOTP, DHCP is based on the idea of a special server that assigns IP addresses 
to hosts asking for one. This server need not be on the same LAN as the requesting host. Since 
the DHCP server may not be reachable by broadcasting, a DHCP relay agent is needed on 
each LAN, as shown in Fig. 5-63. 


Figure 5-63. Operation of DHCP. 
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To find its IP address, a newly-booted machine broadcasts a DHCP DISCOVER packet. The 
DHCP relay agent on its LAN intercepts all DHCP broadcasts. When it finds a DHCP DISCOVER 
packet, it sends the packet as a unicast packet to the DHCP server, possibly on a distant 
network. The only piece of information the relay agent needs is the IP address of the DHCP 
server. 


An issue that arises with automatic assignment of IP addresses from a pool is how long an IP 
address should be allocated. If a host leaves the network and does not return its IP address to 
the DHCP server, that address will be permanently lost. After a period of time, many addresses 
may be lost. To prevent that from happening, IP address assignment may be for a fixed period 
of time, a technique called leasing. Just before the lease expires, the host must ask the DHCP 
for a renewal. If it fails to make a request or the request is denied, the host may no longer use 
the IP address it was given earlier. 


5.3.4 OSPF—The Interior Gateway Routing Protocol 


We have now finished our study of Internet control protocols. It is time to move on the next 
topic: routing in the Internet. As we mentioned earlier, the Internet is made up of a large 
number of autonomous systems. Each AS is operated by a different organization and can use 
its own routing algorithm inside. For example, the internal networks of companies X, Y, and Z 
are usually seen as three ASes if all three are on the Internet. All three may use different 
routing algorithms internally. Nevertheless, having standards, even for internal routing, 
simplifies the implementation at the boundaries between ASes and allows reuse of code. In 
this section we will study routing within an AS. In the next one, we will look at routing between 
ASes. A routing algorithm within an AS is called an interior gateway protocol; an algorithm 
for routing between ASes is called an exterior gateway protocol. 


The original Internet interior gateway protocol was a distance vector protocol (RIP) based on 
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the Bellman-Ford algorithm inherited from the ARPANET. It worked well in small systems, but 
less well as ASes got larger. It also suffered from the count-to-infinity problem and generally 
slow convergence, so it was replaced in May 1979 by a link state protocol. In 1988, the 
Internet Engineering Task Force began work on a successor. That successor, called OSPF 
(Open Shortest Path First), became a standard in 1990. Most router vendors now support it, 
and it has become the main interior gateway protocol. Below we will give a sketch of how 
OSPF works. For the complete story, see RFC 2328. 


Given the long experience with other routing protocols, the group designing the new protocol 
had a long list of requirements that had to be met. First, the algorithm had to be published in 
the open literature, hence the "O" in OSPF. A proprietary solution owned by one company 
would not do. Second, the new protocol had to support a variety of distance metrics, including 
physical distance, delay, and so on. Third, it had to be a dynamic algorithm, one that adapted 
to changes in the topology automatically and quickly. 


Fourth, and new for OSPF, it had to support routing based on type of service. The new protocol 
had to be able to route real-time traffic one way and other traffic a different way. The IP 
protocol has a Type of Service field, but no existing routing protocol used it. This field was 
included in OSPF but still nobody used it, and it was eventually removed. 


Fifth, and related to the above, the new protocol had to do load balancing, splitting the load 
over multiple lines. Most previous protocols sent all packets over the best route. The second- 
best route was not used at all. In many cases, splitting the load over multiple lines gives better 
performance. 


Sixth, support for hierarchical systems was needed. By 1988, the Internet had grown so large 
that no router could be expected to know the entire topology. The new routing protocol had to 
be designed so that no router would have to. 


Seventh, some modicum of security was required to prevent fun-loving students from spoofing 
routers by sending them false routing information. Finally, provision was needed for dealing 
with routers that were connected to the Internet via a tunnel. Previous protocols did not 
handle this well. 


OSPF supports three kinds of connections and networks: 


1. Point-to-point lines between exactly two routers. 
2. Multiaccess networks with broadcasting (e.g., most LANs). 
3. Multiaccess networks without broadcasting (e.g., most packet-switched WANs). 


A multiaccess network is one that can have multiple routers on it, each of which can directly 
communicate with all the others. All LANs and WANs have this property. Figure 5-64(a) shows 
an AS containing all three kinds of networks. Note that hosts do not generally play a role in 
OSPF. 


Figure 5-64. (a) An autonomous system. (b) A graph representation of 
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OSPF operates by abstracting the collection of actual networks, routers, and lines into a 
directed graph in which each arc is assigned a cost (distance, delay, etc.). It then computes 
the shortest path based on the weights on the arcs. A serial connection between two routers is 
represented by a pair of arcs, one in each direction. Their weights may be different. A 
multiaccess network is represented by a node for the network itself plus a node for each 
router. The arcs from the network node to the routers have weight 0 and are omitted from the 
graph. 


Figure 5-64(b) shows the graph representation of the network of Fig. 5-64(a). Weights are 
symmetric, unless marked otherwise. What OSPF fundamentally does is represent the actual 
network as a graph like this and then compute the shortest path from every router to every 
other router. 


Many of the ASes in the Internet are themselves large and nontrivial to manage. OSPF allows 
them to be divided into numbered areas, where an area is a network or a set of contiguous 
networks. Areas do not overlap but need not be exhaustive, that is, some routers may belong 
to no area. An area is a generalization of a subnet. Outside an area, its topology and details 
are not visible. 


Every AS has a backbone area, called area 0. All areas are connected to the backbone, 
possibly by tunnels, so it is possible to go from any area in the AS to any other area in the AS 
via the backbone. A tunnel is represented in the graph as an arc and has a cost. Each router 
that is connected to two or more areas is part of the backbone. As with other areas, the 
topology of the backbone is not visible outside the backbone. 


Within an area, each router has the same link state database and runs the same shortest path 
algorithm. Its main job is to calculate the shortest path from itself to every other router in the 
area, including the router that is connected to the backbone, of which there must be at least 
one. A router that connects to two areas needs the databases for both areas and must run the 
shortest path algorithm for each one separately. 


During normal operation, three kinds of routes may be needed: intra-area, interarea, and 
257 


inter-AS. Intra-area routes are the easiest, since the source router already knows the shortest 
path to the destination router. Interarea routing always proceeds in three steps: go from the 
source to the backbone; go across the backbone to the destination area; go to the destination. 
This algorithm forces a star configuration on OSPF with the backbone being the hub and the 
other areas being spokes. Packets are routed from source to destination "as is." They are not 
encapsulated or tunneled, unless going to an area whose only connection to the backbone is a 
tunnel. Figure 5-65 shows part of the Internet with ASes and areas. 


Figure 5-65. The relation between ASes, backbones, and areas in 
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OSPF distinguishes four classes of routers: 


1. Internal routers are wholly within one area. 

2. Area border routers connect two or more areas. 
3. Backbone routers are on the backbone. 

4. AS boundary routers talk to routers in other ASes. 


These classes are allowed to overlap. For example, all the border routers are automatically 
part of the backbone. In addition, a router that is in the backbone but not part of any other 
area is also an internal router. Examples of all four classes of routers are illustrated in Fig. 5- 
65. 


When a router boots, it sends HELLO messages on all of its point-to-point lines and multicasts 
them on LANs to the group consisting of all the other routers. On WANs, it needs some 
configuration information to know who to contact. From the responses, each router learns who 
its neighbors are. Routers on the same LAN are all neighbors. 


OSPF works by exchanging information between adjacent routers, which is not the same as 
between neighboring routers. In particular, it is inefficient to have every router on a LAN talk 
to every other router on the LAN. To avoid this situation, one router is elected as the 
designated router. It is said to be adjacent to all the other routers on its LAN, and 
exchanges information with them. Neighboring routers that are not adjacent do not exchange 
information with each other. A backup designated router is always kept up to date to ease the 
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transition should the primary designated router crash and need to replaced immediately. 


During normal operation, each router periodically floods LINK STATE UPDATE messages to 
each of its adjacent routers. This message gives its state and provides the costs used in the 
topological database. The flooding messages are acknowledged, to make them reliable. Each 
message has a sequence number, so a router can see whether an incoming LINK STATE 
UPDATE is older or newer than what it currently has. Routers also send these messages when 
a line goes up or down or its cost changes. 


DATABASE DESCRIPTION messages give the sequence numbers of all the link state entries 
currently held by the sender. By comparing its own values with those of the sender, the 
receiver can determine who has the most recent values. These messages are used when a line 
is brought up. 


Either partner can request link state information from the other one by using LINK STATE 
REQUEST messages. The result of this algorithm is that each pair of adjacent routers checks to 
see who has the most recent data, and new information is spread throughout the area this 
way. All these messages are sent as raw IP packets. The five kinds of messages are 
summarized in Fig. 5-66. 


Figure 5-66. The five types of OSPF messages. 


Message type Description 
Hello , Used to discover who the neighbors are 
Link state update _ Provides the sender's costs to its neighbors 
Link state ack Acknowledges link state update 
Database description Announces which updates the sender has 
Link state request | Requests information from the partner 


Finally, we can put all the pieces together. Using flooding, each router informs all the other 
routers in its area of its neighbors and costs. This information allows each router to construct 
the graph for its area(s) and compute the shortest path. The backbone area does this too. In 
addition, the backbone routers accept information from the area border routers in order to 
compute the best route from each backbone router to every other router. This information is 
propagated back to the area border routers, which advertise it within their areas. Using this 
information, a router about to send an interarea packet can select the best exit router to the 
backbone. 


5.3.5 BGP—The Exterior Gateway Routing Protocol 


Within a single AS, the recommended routing protocol is OSPF (although it is certainly not the 
only one in use). Between ASes, a different protocol, BGP (Border Gateway Protocol), is 
used. A different protocol is needed between ASes because the goals of an interior gateway 
protocol and an exterior gateway protocol are not the same. All an interior gateway protocol 
has to do is move packets as efficiently as possible from the source to the destination. It does 
not have to worry about politics. 


Exterior gateway protocol routers have to worry about politics a great deal (Metz, 2001). For 
example, a corporate AS might want the ability to send packets to any Internet site and 
receive packets from any Internet site. However, it might be unwilling to carry transit packets 
originating in a foreign AS and ending in a different foreign AS, even if its own AS was on the 
shortest path between the two foreign ASes ("That's their problem, not ours"). On the other 
hand, it might be willing to carry transit traffic for its neighbors or even for specific other ASes 
that paid it for this service. Telephone companies, for example, might be happy to act as a 
carrier for their customers, but not for others. Exterior gateway protocols in general, and BGP 
in particular, have been designed to allow many kinds of routing policies to be enforced in the 
interAS traffic. 
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Typical policies involve political, security, or economic considerations. A few examples of 
routing constraints are: 


No transit traffic through certain ASes. 
Never put Iraq on a route starting at the Pentagon. 
Do not use the United States to get from British Columbia to Ontario. 


Only transit Albania if there is no alternative to the destination. 
Traffic starting or ending at IBM should not transit Microsoft. 
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Policies are typically manually configured into each BGP router (or included using some kind of 
script). They are not part of the protocol itself. 


From the point of view of a BGP router, the world consists of ASes and the lines connecting 
them. Two ASes are considered connected if there is a line between a border router in each 
one. Given BGP's special interest in transit traffic, networks are grouped into one of three 
categories. The first category is the stub networks, which have only one connection to the 
BGP graph. These cannot be used for transit traffic because there is no one on the other side. 
Then come the multiconnected networks. These could be used for transit traffic, except that 
they refuse. Finally, there are the transit networks, such as backbones, which are willing to 
handle third-party packets, possibly with some restrictions, and usually for pay. 


Pairs of BGP routers communicate with each other by establishing TCP connections. Operating 
this way provides reliable communication and hides all the details of the network being passed 
through. 


BGP is fundamentally a distance vector protocol, but quite different from most others such as 
RIP. Instead of maintaining just the cost to each destination, each BGP router keeps track of 
the path used. Similarly, instead of periodically giving each neighbor its estimated cost to each 
possible destination, each BGP router tells its neighbors the exact path it is using. 


As an example, consider the BGP routers shown in Fig. 5-67(a). In particular, consider F's 
routing table. Suppose that it uses the path FGCD to get to D. When the neighbors give it 
routing information, they provide their complete paths, as shown in Fig. 5-67(b) (for 
simplicity, only destination D is shown here). 


Figure 5-67. (a) A set of BGP routers. (b) Information sent to F. 
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After all the paths come in from the neighbors, F examines them to see which is the best. It 
quickly discards the paths from I and E, since these paths pass through F itself. The choice is 
then between using B and G. Every BGP router contains a module that examines routes to a 
given destination and scores them, returning a number for the "distance" to that destination 
for each route. Any route violating a policy constraint automatically gets a score of infinity. The 
router then adopts the route with the shortest distance. The scoring function is not part of the 
BGP protocol and can be any function the system managers want. 
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BGP easily solves the count-to-infinity problem that plagues other distance vector routing 
algorithms. For example, suppose G crashes or the line FG goes down. F then receives routes 
from its three remaining neighbors. These routes are BCD, IFGCD, and EFGCD. It can 
immediately see that the two latter routes are pointless, since they pass through F itself, so it 
chooses FBCD as its new route. Other distance vector algorithms often make the wrong choice 


because they cannot tell which of their neighbors have independent routes to the destination 
and which do not. The definition of BGP is in RFCs 1771 to 1774. 


5.3.6 Internet Multicasting 


Normal IP communication is between one sender and one receiver. However, for some 
applications it is useful for a process to be able to send to a large number of receivers 
simultaneously. Examples are updating replicated, distributed databases, transmitting stock 
quotes to multiple brokers, and handling digital conference (i.e., multiparty) telephone calls. 


IP supports multicasting, using class D addresses. Each class D address identifies a group of 
hosts. Twenty-eight bits are available for identifying groups, so over 250 million groups can 
exist at the same time. When a process sends a packet to a class D address, a best-efforts 

attempt is made to deliver it to all the members of the group addressed, but no guarantees 
are given. Some members may not get the packet. 


Two kinds of group addresses are supported: permanent addresses and temporary ones. A 
permanent group is always there and does not have to be set up. Each permanent group has a 
permanent group address. Some examples of permanent group addresses are: 


224.0.0.1 All systems on a LAN 

224.0.0.2 All routers on a LAN 

224.0.0.5 All OSPF routers on a LAN 

224.0.0.6 All designated OSPF routers on a LAN 


Temporary groups must be created before they can be used. A process can ask its host to join 
a specific group. It can also ask its host to leave the group. When the last process on a host 
leaves a group, that group is no longer present on the host. Each host keeps track of which 
groups its processes currently belong to. 


Multicasting is implemented by special multicast routers, which may or may not be colocated 
with the standard routers. About once a minute, each multicast router sends a hardware (i.e., 
data link layer) multicast to the hosts on its LAN (address 224.0.0.1) asking them to report 
back on the groups their processes currently belong to. Each host sends back responses for all 
the class D addresses it is interested in. 


These query and response packets use a protocol called IGMP (Internet Group 
Management Protocol), which is vaguely analogous to ICMP. It has only two kinds of 
packets: query and response, each with a simple, fixed format containing some control 
information in the first word of the payload field and a class D address in the second word. It is 
described in RFC 1112. 


Multicast routing is done using spanning trees. Each multicast router exchanges information 
with its neighbors, using a modified distance vector protocol in order for each one to construct 
a spanning tree per group covering all group members. Various optimizations are used to 
prune the tree to eliminate routers and networks not interested in particular groups. The 
protocol makes heavy use of tunneling to avoid bothering nodes not in a spanning tree. 
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5.3.7 Mobile IP 


Many users of the Internet have portable computers and want to stay connected to the 
Internet when they visit a distant Internet site and even on the road in between. 
Unfortunately, the IP addressing system makes working far from home easier said than done. 


In this section we will examine the problem and the solution. A more detailed description is 
given in (Perkins, 1998a). 


The real villain is the addressing scheme itself. Every IP address contains a network number 
and a host number. For example, consider the machine with IP address 160.80.40.20/16. The 
160.80 gives the network number (8272 in decimal); the 40.20 is the host number (10260 in 
decimal). Routers all over the world have routing tables telling which line to use to get to 
network 160.80. Whenever a packet comes in with a destination IP address of the form 
160.80.xxx.yyy, it goes out on that line. 


If all of a sudden, the machine with that address is carted off to some distant site, the packets 
for it will continue to be routed to its home LAN (or router). The owner will no longer get e- 
mail, and so on. Giving the machine a new IP address corresponding to its new location is 
unattractive because large numbers of people, programs, and databases would have to be 
informed of the change. 


Another approach is to have the routers use complete IP addresses for routing, instead of just 
the network. However, this strategy would require each router to have millions of table entries, 
at astronomical cost to the Internet. 


When people began demanding the ability to connect their notebook computers to the Internet 
wherever they were, IETF set up a Working Group to find a solution. The Working Group 
quickly formulated a number of goals considered desirable in any solution. The major ones 
were: 


Each mobile host must be able to use its home IP address anywhere. 
Software changes to the fixed hosts were not permitted. 

Changes to the router software and tables were not permitted. 

Most packets for mobile hosts should not make detours on the way. 
No overhead should be incurred when a mobile host is at home. 


QUIDEM 


The solution chosen was the one described in Sec. 5.2.8. To review it briefly, every site that 
wants to allow its users to roam has to create a home agent. Every site that wants to allow 
visitors has to create a foreign agent. When a mobile host shows up at a foreign site, it 
contacts the foreign host there and registers. The foreign host then contacts the user's home 
agent and gives it a care-of address, normally the foreign agent's own IP address. 


When a packet arrives at the user's home LAN, it comes in at some router attached to the LAN. 
The router then tries to locate the host in the usual way, by broadcasting an ARP packet 
asking, for example: What is the Ethernet address of 160.80.40.20? The home agent responds 
to this query by giving its own Ethernet address. The router then sends packets for 
160.80.40.20 to the home agent. It, in turn, tunnels them to the care-of address by 
encapsulating them in the payload field of an IP packet addressed to the foreign agent. The 
foreign agent then decapsulates and delivers them to the data link address of the mobile host. 
In addition, the home agent gives the care-of address to the sender, so future packets can be 
tunneled directly to the foreign agent. This solution meets all the requirements stated above. 


One small detail is probably worth mentioning. At the time the mobile host moves, the router 
probably has its (soon-to-be-invalid) Ethernet address cached. Replacing that Ethernet address 
with the home agent's is done by a trick called gratuitous ARP. This is a special, unsolicited 
message to the router that causes it to replace a specific cache entry, in this case, that of the 
mobile host about to leave. When the mobile host returns later, the same trick is used to 
update the router's cache again. 
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Nothing in the design prevents a mobile host from being its own foreign agent, but that 
approach only works if the mobile host (in its capacity as foreign agent) is logically connected 
to the Internet at its current site. Also, the mobile host must be able to acquire a (temporary) 


care-of IP address to use. That IP address must belong to the LAN to which it is currently 
attached. 


The IETF solution for mobile hosts solves a number of other problems not mentioned so far. 
For example, how are agents located? The solution is for each agent to periodically broadcast 
its address and the type of services it is willing to provide (e.g., home, foreign, or both). When 
a mobile host arrives somewhere, it can just listen for these broadcasts, called 
advertisements. Alternatively, it can broadcast a packet announcing its arrival and hope that 
the local foreign agent responds to it. 


Another problem that had to be solved is what to do about impolite mobile hosts that leave 
without saying goodbye. The solution is to make registration valid only for a fixed time 
interval. If it is not refreshed periodically, it times out, so the foreign host can clear its tables. 


Yet another issue is security. When a home agent gets a message asking it to please forward 
all of Roberta's packets to some IP address, it had better not comply unless it is convinced that 
Roberta is the source of this request, and not somebody trying to impersonate her. 
Cryptographic authentication protocols are used for this purpose. We will study such protocols 


in Chap. 8. 


A final point addressed by the Working Group relates to levels of mobility. Imagine an airplane 
with an on-board Ethernet used by the navigation and avionics computers. On this Ethernet is 
a standard router that talks to the wired Internet on the ground over a radio link. One fine 
day, some clever marketing executive gets the idea to install Ethernet connectors in all the 
arm rests so passengers with mobile computers can also plug in. 


Now we have two levels of mobility: the aircraft's own computers, which are stationary with 
respect to the Ethernet, and the passengers' computers, which are mobile with respect to it. In 
addition, the on-board router is mobile with respect to routers on the ground. Being mobile 
with respect to a system that is itself mobile can be handled using recursive tunneling. 


5.3.8 IPv6 


While CIDR and NAT may buy a few more years' time, everyone realizes that the days of IP in 
its current form (IPv4) are numbered. In addition to these technical problems, another issue 
looms in the background. In its early years, the Internet was largely used by universities, 
high-tech industry, and the U.S. Government (especially the Dept. of Defense). With the 
explosion of interest in the Internet starting in the mid-1990s, it began to be used by a 
different group of people, especially people with different requirements. For one thing, 
numerous people with wireless portables use it to keep in contact with their home bases. For 
another, with the impending convergence of the computer, communication, and entertainment 
industries, it may not be that long before every telephone and television set in the world is an 
Internet node, producing a billion machines being used audio and video on demand. Under 
these circumstances, it became apparent that IP had to evolve and become more flexible. 


Seeing these problems on the horizon, in 1990, IETF started work on a new version of IP, one 
which would never run out of addresses, would solve a variety of other problems, and be more 
flexible and efficient as well. Its major goals were: 


Support billions of hosts, even with inefficient address space allocation. 

Reduce the size of the routing tables. 

Simplify the protocol, to allow routers to process packets faster. 

Provide better security (authentication and privacy) than current IP. 

Pay more attention to type of service, particularly for real-time data. 
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Aid multicasting by allowing scopes to be specified. 
Make it possible for a host to roam without changing its address. 
Allow the protocol to evolve in the future. 


I npo 


Permit the old and new protocols to coexist for years. 


To develop a protocol that met all these requirements, IETF issued a call for proposals and 
discussion in RFC 1550. Twenty-one responses were received, not all of them full proposals. By 
December 1992, seven serious proposals were on the table. They ranged from making minor 
patches to IP, to throwing it out altogether and replacing with a completely different protocol. 


One proposal was to run TCP over CLNP, which, with its 160-bit addresses would have 
provided enough address space forever and would have unified two major network layer 
protocols. However, many people felt that this would have been an admission that something 
in the OSI world was actually done right, a statement considered Politically Incorrect in 
Internet circles. CLNP was patterned closely on IP, so the two are not really that different. In 
fact, the protocol ultimately chosen differs from IP far more than CLNP does. Another strike 
against CLNP was its poor support for service types, something required to transmit 
multimedia efficiently. 


Three of the better proposals were published in IEEE Network (Deering, 1993; Francis, 1993; 
and Katz and Ford, 1993). After much discussion, revision, and jockeying for position, a 
modified combined version of the Deering and Francis proposals, by now called SIPP (Simple 
Internet Protocol Plus) was selected and given the designation IPv6. 


IPv6 meets the goals fairly well. It maintains the good features of IP, discards or deemphasizes 
the bad ones, and adds new ones where needed. In general, IPv6 is not compatible with IPv4, 
but it is compatible with the other auxiliary Internet protocols, including TCP, UDP, ICMP, 
IGMP, OSPF, BGP, and DNS, sometimes with small modifications being required (mostly to deal 
with longer addresses). The main features of IPv6 are discussed below. More information 
about it can be found in RFCs 2460 through 2466. 


First and foremost, IPv6 has longer addresses than IPv4. They are 16 bytes long, which solves 
the problem that IPv6 set out to solve: provide an effectively unlimited supply of Internet 
addresses. We will have more to say about addresses shortly. 


The second major improvement of IPv6 is the simplification of the header. It contains only 
seven fields (versus 13 in IPv4). This change allows routers to process packets faster and thus 
improve throughput and delay. We will discuss the header shortly, too. 


The third major improvement was better support for options. This change was essential with 
the new header because fields that previously were required are now optional. In addition, the 
way options are represented is different, making it simple for routers to skip over options not 
intended for them. This feature speeds up packet processing time. 


A fourth area in which IPv6 represents a big advance is in security. IETF had its fill of 
newspaper stories about precocious 12-year-olds using their personal computers to break into 
banks and military bases all over the Internet. There was a strong feeling that something had 
to be done to improve security. Authentication and privacy are key features of the new IP. 
These were later retrofitted to IPv4, however, so in the area of security the differences are not 
SO great any more. 


Finally, more attention has been paid to quality of service. Various half-hearted efforts have 
been made in the past, but now with the growth of multimedia on the Internet, the sense of 
urgency is greater. 


The Main IPv6 Header 
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The IPv6 header is shown in Fig. 5-68. The Version field is always 6 for IPv6 (and 4 for IPv4). 
During the transition period from IPv4, which will probably take a decade, routers will be able 


to examine this field to tell what kind of packet they have. As an aside, making this test 
wastes a few instructions in the critical path, so many implementations are likely to try to 
avoid it by using some field in the data link header to distinguish IPv4 packets from IPv6 
packets. In this way, packets can be passed to the correct network layer handler directly. 
However, having the data link layer be aware of network packet types completely violates the 
design principle that each layer should not be aware of the meaning of the bits given to it from 
the layer above. The discussions between the "Do it right" and "Make it fast" camps will no 
doubt be lengthy and vigorous. 


Figure 5-68. The IPv6 fixed header (required). 
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The Traffic class field is used to distinguish between packets with different real-time delivery 
requirements. A field designed for this purpose has been in IP since the beginning, but it has 
been only sporadically implemented by routers. Experiments are now underway to determine 
how best it can be used for multimedia delivery. 


The Flow label field is also still experimental but will be used to allow a source and destination 
to set up a pseudoconnection with particular properties and requirements. For example, a 
stream of packets from one process on a certain source host to a certain process on a certain 
destination host might have stringent delay requirements and thus need reserved bandwidth. 
The flow can be set up in advance and given an identifier. When a packet with a nonzero Flow 
label shows up, all the routers can look it up in internal tables to see what kind of special 
treatment it requires. In effect, flows are an attempt to have it both ways: the flexibility of a 
datagram subnet and the guarantees of a virtual-circuit subnet. 


Each flow is designated by the source address, destination address, and flow number, so many 
flows may be active at the same time between a given pair of IP addresses. Also, in this way, 
even if two flows coming from different hosts but with the same flow label pass through the 
same router, the router will be able to tell them apart using the source and destination 
addresses. It is expected that flow labels will be chosen randomly, rather than assigned 
sequentially starting at 1, so routers as expected to hash them. 


The Payload length field tells how many bytes follow the 40-byte header of Fig. 5-68. The 

name was changed from the IPv4 Total length field because the meaning was changed slightly: 

the 40 header bytes are no longer counted as part of the length (as they used to be). 

The Next header field lets the cat out of the bag. The reason the header could be simplified is 
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that there can be additional (optional) extension headers. This field tells which of the 
(currently) six extension headers, if any, follow this one. If this header is the last IP header, 
the Next header field tells which transport protocol handler (e.g., TCP, UDP) to pass the packet 
to. 


The Hop limit field is used to keep packets from living forever. It is, in practice, the same as 
the Time to live field in IPv4, namely, a field that is decremented on each hop. In theory, in 
IPv4 it was a time in seconds, but no router used it that way, so the name was changed to 
reflect the way it is actually used. 


Next come the Source address and Destination address fields. Deering's original proposal, SIP, 
used 8-byte addresses, but during the review process many people felt that with 8-byte 
addresses IPv6 would run out of addresses within a few decades, whereas with 16-byte 
addresses it would never run out. Other people argued that 16 bytes was overkill, whereas still 
others favored using 20-byte addresses to be compatible with the OSI datagram protocol. Still 
another faction wanted variable-sized addresses. After much debate, it was decided that fixed- 
length 16-byte addresses were the best compromise. 


A new notation has been devised for writing 16-byte addresses. They are written as eight 
groups of four hexadecimal digits with colons between the groups, like this: 


8000:0000:0000:0000:0123:4567:89AB:CDEF 


Since many addresses will have many zeros inside them, three optimizations have been 
authorized. First, leading zeros within a group can be omitted, so 0123 can be written as 123. 
Second, one or more groups of 16 zero bits can be replaced by a pair of colons. Thus, the 
above address now becomes 


8000::123:4567:89AB:CDEF 


Finally, IPv4 addresses can be written as a pair of colons and an old dotted decimal number, 
for example 


::192.31.20.46 


Perhaps it is unnecessary to be so explicit about it, but there are a lot of 16-byte addresses. 
Specifically, there are 2??? of them, which is approximately 3 x 1038. If the entire earth, land 
and water, were covered with computers, IPv6 would allow 7 x 1023 IP addresses per square 
meter. Students of chemistry will notice that this number is larger than Avogadro's number. 
While it was not the intention to give every molecule on the surface of the earth its own IP 
address, we are not that far off. 


In practice, the address space will not be used efficiently, just as the telephone number 
address space is not (the area code for Manhattan, 212, is nearly full, but that for Wyoming, 
307, is nearly empty). In RFC 3194, Durand and Huitema calculated that, using the allocation 
of telephone numbers as a guide, even in the most pessimistic scenario there will still be well 
over 1000 IP addresses per square meter of the entire earth's surface (land and water). In any 
likely scenario, there will be trillions of them per square meter. In short, it seems unlikely that 
we will run out in the foreseeable future. 


It is instructive to compare the IPv4 header (Fig. 5-53) with the IPv6 header (Fig. 5-68) to see 
what has been left out in IPv6. The JHL field is gone because the IPv6 header has a fixed 
length. The Protocol field was taken out because the Next header field tells what follows the 
last IP header (e.g., a UDP or TCP segment). 


All the fields relating to fragmentation were removed because IPv6 takes a different approach 

to fragmentation. To start with, all IPv6-conformant hosts are expected to dynamically 

determine the datagram size to use. This rule makes fragmentation less likely to occur in the 

first place. Also, the minimum has been raised from 576 to 1280 to allow 1024 bytes of data 
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and many headers. In addition, when a host sends an IPv6 packet that is too large, instead of 
fragmenting it, the router that is unable to forward it sends back an error message. This 
message tells the host to break up all future packets to that destination. Having the host send 
packets that are the right size in the first place is ultimately much more efficient than having 
the routers fragment them on the fly. 


Finally, the Checksum field is gone because calculating it greatly reduces performance. With 
the reliable networks now used, combined with the fact that the data link layer and transport 
layers normally have their own checksums, the value of yet another checksum was not worth 
the performance price it extracted. Removing all these features has resulted in a lean and 
mean network layer protocol. Thus, the goal of IPv6—a fast, yet flexible, protocol with plenty 
of address space—has been met by this design. 


Extension Headers 


Some of the missing IPv4 fields are occasionally still needed, so IPv6 has introduced the 
concept of an (optional) extension header. These headers can be supplied to provide extra 
information, but encoded in an efficient way. Six kinds of extension headers are defined at 
present, as listed in Fig. 5-69. Each one is optional, but if more than one is present, they must 
appear directly after the fixed header, and preferably in the order listed. 


Figure 5-69. IPv6 extension headers. 


Extension header | Description 
Hop-by-hop options _ Miscellaneous information for routers 
Destination options Additional information for the destination 
Routing Loose list of routers to visit 
Fragmentation | Management of datagram fragments 
Authentication _ Verification of the sender's identity 
Encrypted security payload | Information about the encrypted contents 


Some of the headers have a fixed format; others contain a variable number of variable-length 
fields. For these, each item is encoded as a (Type, Length, Value) tuple. The Type is a 1-byte 
field telling which option this is. The Type values have been chosen so that the first 2 bits tell 
routers that do not know how to process the option what to do. The choices are: skip the 
option; discard the packet; discard the packet and send back an ICMP packet; and the same as 
the previous one, except do not send ICMP packets for multicast addresses (to prevent one 
bad multicast packet from generating millions of ICMP reports). 


The Length is also a 1-byte field. It tells how long the value is (0 to 255 bytes). The Value is 
any information required, up to 255 bytes. 


The hop-by-hop header is used for information that all routers along the path must examine. 
So far, one option has been defined: support of datagrams exceeding 64K. The format of this 
header is shown in Fig. 5-70. When it is used, the Payload length field in the fixed header is 
set to zero. 


Figure 5-70. The hop-by-hop extension header for large datagrams 
(jumbograms). 
Next header 


Jumbo payload length 


As with all extension headers, this one starts out with a byte telling what kind of header comes 
next. This byte is followed by one telling how long the hop-by-hop header is in bytes, 
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excluding the first 8 bytes, which are mandatory. All extensions begin this way. 


The next 2 bytes indicate that this option defines the datagram size (code 194) and that the 
size is a 4-byte number. The last 4 bytes give the size of the datagram. Sizes less than 65,536 
bytes are not permitted and will result in the first router discarding the packet and sending 
back an ICMP error message. Datagrams using this header extension are called jumbograms. 
The use of jumbograms is important for supercomputer applications that must transfer 
gigabytes of data efficiently across the Internet. 


The destination options header is intended for fields that need only be interpreted at the 
destination host. In the initial version of IPv6, the only options defined are null options for 
padding this header out to a multiple of 8 bytes, so initially it will not be used. It was included 
to make sure that new routing and host software can handle it, in case someone thinks of a 
destination option some day. 


The routing header lists one or more routers that must be visited on the way to the 
destination. It is very similar to the IPv4 loose source routing in that all addresses listed must 
be visited in order, but other routers not listed may be visited in between. The format of the 
routing header is shown in Fig. 5-71. 


Figure 5-71. The extension header for routing. 


Header extension 
length 


Next header Routing type Segments left 


Type-specific data L. 


The first 4 bytes of the routing extension header contain four 1-byte integers. The Next header 
and Header entension length fields were described above. The Routing type field gives the 
format of the rest of the header. Type 0 says that a reserved 32-bit word follows the first 
word, followed by some number of IPv6 addresses. Other types may be invented in the future 
as needed. Finally, the Segments left field keeps track of how many of the addresses in the list 
have not yet been visited. It is decremented every time one is visited. When it hits 0, the 
packet is on its own with no more guidance about what route to follow. Usually at this point it 
is so close to the destination that the best route is obvious. 


The fragment header deals with fragmentation similarly to the way IPv4 does. The header 
holds the datagram identifier, fragment number, and a bit telling whether more fragments will 
follow. In IPv6, unlike in IPv4, only the source host can fragment a packet. Routers along the 
way may not do this. Although this change is a major philosophical break with the past, it 
simplifies the routers' work and makes routing go faster. As mentioned above, if a router is 
confronted with a packet that is too big, it discards the packet and sends an ICMP packet back 
to the source. This information allows the source host to fragment the packet into smaller 
pieces using this header and try again. 


The authentication header provides a mechanism by which the receiver of a packet can be sure 
of who sent it. The encrypted security payload makes it possible to encrypt the contents of a 


packet so that only the intended recipient can read it. These headers use cryptographic 
techniques to accomplish their missions. 


Controversies 


Given the open design process and the strongly-held opinions of many of the people involved, 
it should come as no surprise that many choices made for IPv6 were highly controversial, to 
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say the least. We will summarize a few of these briefly below. For all the gory details, see the 
RFCs. 


We have already mentioned the argument about the address length. The result was a 
compromise: 16-byte fixed-length addresses. 


Another fight developed over the length of the Hop limit field. One camp felt strongly that 
limiting the maximum number of hops to 255 (implicit in using an 8-bit field) was a gross 
mistake. After all, paths of 32 hops are common now, and 10 years from now much longer 
paths may be common. These people argued that using a huge address size was farsighted but 
using a tiny hop count was short-sighted. In their view, the greatest sin a computer scientist 
can commit is to provide too few bits somewhere. 


The response was that arguments could be made to increase every field, leading to a bloated 
header. Also, the function of the Hop limit field is to keep packets from wandering around for a 
long time and 65,535 hops is far too long. Finally, as the Internet grows, more and more long- 
distance links will be built, making it possible to get from any country to any other country in 
half a dozen hops at most. If it takes more than 125 hops to get from the source and 
destination to their respective international gateways, something is wrong with the national 
backbones. The 8-bitters won this one. 


Another hot potato was the maximum packet size. The supercomputer community wanted 
packets in excess of 64 KB. When a supercomputer gets started transferring, it really means 
business and does not want to be interrupted every 64 KB. The argument against large 
packets is that if a 1-MB packet hits a 1.5-Mbps T1 line, that packet will tie the line up for over 
5 seconds, producing a very noticeable delay for interactive users sharing the line. A 
compromise was reached here: normal packets are limited to 64 KB, but the hop-by-hop 
extension header can be used to permit jumbograms. 


A third hot topic was removing the IPv4 checksum. Some people likened this move to 
removing the brakes from a car. Doing so makes the car lighter so it can go faster, but if an 
unexpected event happens, you have a problem. 


The argument against checksums was that any application that really cares about data 
integrity has to have a transport layer checksum anyway, so having another one in IP (in 
addition to the data link layer checksum) is overkill. Furthermore, experience showed that 
computing the IP checksum was a major expense in IPv4. The antichecksum camp won this 
one, and IPv6 does not have a checksum. 


Mobile hosts were also a point of contention. If a portable computer flies halfway around the 
world, can it continue operating at the destination with the same IPv6 address, or does it have 
to use a scheme with home agents and foreign agents? Mobile hosts also introduce 
asymmetries into the routing system. It may well be the case that a small mobile computer 
can easily hear the powerful signal put out by a large stationary router, but the stationary 
router cannot hear the feeble signal put out by the mobile host. Consequently, some people 
wanted to build explicit support for mobile hosts into IPv6. That effort failed when no 
consensus could be found for any specific proposal. 


Probably the biggest battle was about security. Everyone agreed it was essential, The war was 
about where and how. First where. The argument for putting it in the network layer is that it 
then becomes a standard service that all applications can use without any advance planning. 
The argument against it is that really secure applications generally want nothing less than end- 
to-end encryption, where the source application does the encryption and the destination 
application undoes it. With anything less, the user is at the mercy of potentially buggy network 
layer implementations over which he has no control. The response to this argument is that 
these applications can just refrain from using the IP security features and do the job 
themselves. The rejoinder to that is that the people who do not trust the network to do it right, 
do not want to pay the price of slow, bulky IP implementations that have this capability, even 
if it is disabled. 
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Another aspect of where to put security relates to the fact that many (but not all) countries 
have stringent export laws concerning cryptography. Some, notably France and Iraq, also 
restrict its use domestically, so that people cannot have secrets from the police. As a result, 
any IP implementation that used a cryptographic system strong enough to be of much value 
could not be exported from the United States (and many other countries) to customers 
worldwide. Having to maintain two sets of software, one for domestic use and one for export, 
is something most computer vendors vigorously oppose. 


One point on which there was no controversy is that no one expects the IPv4 Internet to be 
turned off on a Sunday morning and come back up as an IPv6 Internet Monday morning. 
Instead, isolated "islands" of IPv6 will be converted, initially communicating via tunnels. As the 
IPv6 islands grow, they will merge into bigger islands. Eventually, all the islands will merge, 
and the Internet will be fully converted. Given the massive investment in IPv4 routers 
currently deployed, the conversion process will probably take a decade. For this reason, an 
enormous amount of effort has gone into making sure that this transition will be as painless as 
possible. For more information about IPv6, see (Loshin, 1999). 


5.4 Summary 


The network layer provides services to the transport layer. It can be based on either virtual 
circuits or datagrams. In both cases, its main job is routing packets from the source to the 
destination. In virtual-circuit subnets, a routing decision is made when the virtual circuit is set 
up. In datagram subnets, it is made on every packet. 


Many routing algorithms are used in computer networks. Static algorithms include shortest 
path routing and flooding. Dynamic algorithms include distance vector routing and link state 
routing. Most actual networks use one of these. Other important routing topics are hierarchical 
routing, routing for mobile hosts, broadcast routing, multicast routing, and routing in peer-to- 
peer networks. 


Subnets can easily become congested, increasing the delay and lowering the throughput for 
packets. Network designers attempt to avoid congestion by proper design. Techniques include 
retransmission policy, caching, flow control, and more. If congestion does occur, it must be 
dealt with. Choke packets can be sent back, load can be shed, and other methods applied. 


The next step beyond just dealing with congestion is to actually try to achieve a promised 
quality of service. The methods that can be used for this include buffering at the client, traffic 
shaping, resource reservation, and admission control. Approaches that have been designed for 
good quality of service include integrated services (including RSVP), differentiated services, 
and MPLS. 


Networks differ in various ways, so when multiple networks are interconnected problems can 
occur. Sometimes the problems can be finessed by tunneling a packet through a hostile 


network, but if the source and destination networks are different, this approach fails. When 
different networks have different maximum packet sizes, fragmentation may be called for. 


The Internet has a rich variety of protocols related to the network layer. These include the 
data transport protocol, IP, but also the control protocols ICMP, ARP, and RARP, and the 
routing protocols OSPF and BGP. The Internet is rapidly running out of IP addresses, so a new 
version of IP, IPv6, has been developed. 
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Chapter 6. The Transport Layer 


6.1 Elements of Transport Protocols 


The transport service is implemented by a transport protocol used between the two 
transport entities. In some ways, transport protocols resemble the data link protocols we 
studied in detail in Chap. 3. Both have to deal with error control, sequencing, and flow control, 
among other issues. 


However, significant differences between the two also exist. These differences are due to 
major dissimilarities between the environments in which the two protocols operate, as shown 
in Fig. 6-7. At the data link layer, two routers communicate directly via a physical channel, 
whereas at the transport layer, this physical channel is replaced by the entire subnet. This 
difference has many important implications for the protocols, as we shall see in this chapter. 


Figure 6-7. (a) Environment of the data link layer. (b) Environment of 
the transport layer. 
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For one thing, in the data link layer, it is not necessary for a router to specify which router it 
wants to talk to—each outgoing line uniquely specifies a particular router. In the transport 
layer, explicit addressing of destinations is required. 


For another thing, the process of establishing a connection over the wire of Fig. 6-7(a) is 
simple: the other end is always there (unless it has crashed, in which case it is not there). 
Either way, there is not much to do. In the transport layer, initial connection establishment is 
more complicated, as we will see. 


Another, exceedingly annoying, difference between the data link layer and the transport layer 
is the potential existence of storage capacity in the subnet. When a router sends a frame, it 
may arrive or be lost, but it cannot bounce around for a while, go into hiding in a far corner of 
the world, and then suddenly emerge at an inopportune moment 30 sec later. If the subnet 
uses datagrams and adaptive routing inside, there is a nonnegligible probability that a packet 
may be stored for a number of seconds and then delivered later. The consequences of the 
subnet's ability to store packets can sometimes be disastrous and can require the use of 
special protocols. 


A final difference between the data link and transport layers is one of amount rather than of 
kind. Buffering and flow control are needed in both layers, but the presence of a large and 
dynamically varying number of connections in the transport layer may require a different 
approach than we used in the data link layer. In Chap. 3, some of the protocols allocate a fixed 
number of buffers to each line, so that when a frame arrives a buffer is always available. In 
the transport layer, the larger number of connections that must be managed make the idea of 
dedicating many buffers to each one less attractive. In the following sections, we will examine 
all of these important issues and others. 
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6.1.1 Addressing 


When an application (e.g., a user) process wishes to set up a connection to a remote 
application process, it must specify which one to connect to. (Connectionless transport has the 
same problem: To whom should each message be sent?) The method normally used is to 
define transport addresses to which processes can listen for connection requests. In the 
Internet, these end points are called ports. In ATM networks, they are called AAL-SAPs. We 
will use the generic term TSAP, (Transport Service Access Point). The analogous end 
points in the network layer (i.e., network layer addresses) are then called NSAPs. IP 
addresses are examples of NSAPs. 


Figure 6-8 illustrates the relationship between the NSAP, TSAP and transport connection. 
Application processes, both clients and servers, can attach themselves to a TSAP to establish a 
connection to a remote TSAP. These connections run through NSAPs on each host, as shown. 
The purpose of having TSAPs is that in some networks, each computer has a single NSAP, so 
some way is needed to distinguish multiple transport end points that share that NSAP. 


Figure 6-8. TSAPs, NSAPs, and transport connections. 
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A possible scenario for a transport connection is as follows. 


1. A time of day server process on host 2 attaches itself to TSAP 1522 to wait for an 
incoming call. How a process attaches itself to a TSAP is outside the networking model 
and depends entirely on the local operating system. A call such as our LISTEN might be 
used, for example. 

2. An application process on host 1 wants to find out the time-of-day, so it issues a 
CONNECT request specifying TSAP 1208 as the source and TSAP 1522 as the 
destination. This action ultimately results in a transport connection being established 
between the application process on host 1 and server 1 on host 2. 


The application process then sends over a request for the time. 
The time server process responds with the current time. 
The transport connection is then released. 


Dg 


Note that there may well be other servers on host 2 that are attached to other TSAPs and 
waiting for incoming connections that arrive over the same NSAP. 


The picture painted above is fine, except we have swept one little problem under the rug: How 
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does the user process on host 1 know that the time-of-day server is attached to TSAP 1522? 
One possibility is that the time-of-day server has been attaching itself to TSAP 1522 for years 
and gradually all the network users have learned this. In this model, services have stable TSAP 
addresses that are listed in files in well-known places, such as the /etc/services file on UNIX 
systems, which lists which servers are permanently attached to which ports. 


While stable TSAP addresses work for a small number of key services that never change (e.g. 
the Web server), user processes, in general, often want to talk to other user processes that 
only exist for a short time and do not have a TSAP address that is known in advance. 
Furthermore, if there are potentially many server processes, most of which are rarely used, it 
is wasteful to have each of them active and listening to a stable TSAP address all day long. In 
short, a better scheme is needed. 


One such scheme is shown in Fig. 6-9 in a simplified form. It is known as the initial 
connection protocol. Instead of every conceivable server listening at a well-known TSAP, 
each machine that wishes to offer services to remote users has a special process server that 
acts as a proxy for less heavily used servers. It listens to a set of ports at the same time, 
waiting for a connection request. Potential users of a service begin by doing a CONNECT 
request, specifying the TSAP address of the service they want. If no server is waiting for them, 
they get a connection to the process server, as shown in Fig. 6-9(a). 


Figure 6-9. How a user process in host 1 establishes a connection with 
a time-of-day server in host 2. 
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After it gets the incoming request, the process server spawns the requested server, allowing it 
to inherit the existing connection with the user. The new server then does the requested work, 
while the process server goes back to listening for new requests, as shown in Fig. 6-9(b). 


While the initial connection protocol works fine for those servers that can be created as they 
are needed, there are many situations in which services do exist independently of the process 
server. A file server, for example, needs to run on special hardware (a machine with a disk) 
and cannot just be created on-the-fly when someone wants to talk to it. 


To handle this situation, an alternative scheme is often used. In this model, there exists a 
special process called a name server or sometimes a directory server. To find the TSAP 
address corresponding to a given service name, such as "time of day," a user sets up a 
connection to the name server (which listens to a well-known TSAP). The user then sends a 
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message specifying the service name, and the name server sends back the TSAP address. 
Then the user releases the connection with the name server and establishes a new one with 
the desired service. 


In this model, when a new service is created, it must register itself with the name server, 
giving both its service name (typically, an ASCII string) and its TSAP. The name server records 
this information in its internal database so that when queries come in later, it will know the 
answers. 


The function of the name server is analogous to the directory assistance operator in the 
telephone system—it provides a mapping of names onto numbers. Just as in the telephone 
system, it is essential that the address of the well-known TSAP used by the name server (or 
the process server in the initial connection protocol) is indeed well known. If you do not know 
the number of the information operator, you cannot call the information operator to find it out. 
If you think the number you dial for information is obvious, try it in a foreign country 
sometime. 


6.1.2 Connection Establishment 


Establishing a connection sounds easy, but it is actually surprisingly tricky. At first glance, it 
would seem sufficient for one transport entity to just send a CONNECTION REQUEST TPDU to 
the destination and wait for a CONNECTION ACCEPTED reply. The problem occurs when the 
network can lose, store, and duplicate packets. This behavior causes serious complications. 


Imagine a subnet that is so congested that acknowledgements hardly ever get back in time 
and each packet times out and is retransmitted two or three times. Suppose that the subnet 
uses datagrams inside and that every packet follows a different route. Some of the packets 
might get stuck in a traffic jam inside the subnet and take a long time to arrive, that is, they 
are stored in the subnet and pop out much later. 


The worst possible nightmare is as follows. A user establishes a connection with a bank, sends 
messages telling the bank to transfer a large amount of money to the account of a not- 
entirely-trustworthy person, and then releases the connection. Unfortunately, each packet in 
the scenario is duplicated and stored in the subnet. After the connection has been released, all 
the packets pop out of the subnet and arrive at the destination in order, asking the bank to 
establish a new connection, transfer money (again), and release the connection. The bank has 
no way of telling that these are duplicates. It must assume that this is a second, independent 
transaction, and transfers the money again. For the remainder of this section we will study the 
problem of delayed duplicates, with special emphasis on algorithms for establishing 
connections in a reliable way, so that nightmares like the one above cannot happen. 


The crux of the problem is the existence of delayed duplicates. It can be attacked in various 
ways, none of them very satisfactory. One way is to use throw-away transport addresses. In 
this approach, each time a transport address is needed, a new one is generated. When a 
connection is released, the address is discarded and never used again. This strategy makes the 
process server model of Fig. 6-9 impossible. 


Another possibility is to give each connection a connection identifier (i.e., a sequence number 
incremented for each connection established) chosen by the initiating party and put in each 
TPDU, including the one requesting the connection. After each connection is released, each 
transport entity could update a table listing obsolete connections as (peer transport entity, 
connection identifier) pairs. Whenever a connection request comes in, it could be checked 
against the table, to see if it belonged to a previously-released connection. 


Unfortunately, this scheme has a basic flaw: it requires each transport entity to maintain a 
certain amount of history information indefinitely. If a machine crashes and loses its memory, 
it will no longer know which connection identifiers have already been used. 
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Instead, we need to take a different tack. Rather than allowing packets to live forever within 
the subnet, we must devise a mechanism to kill off aged packets that are still hobbling about. 
If we can ensure that no packet lives longer than some known time, the problem becomes 
somewhat more manageable. 


Packet lifetime can be restricted to a known maximum using one (or more) of the following 
techniques: 


1. Restricted subnet design. 
2. Putting a hop counter in each packet. 
3. Timestamping each packet. 


The first method includes any method that prevents packets from looping, combined with 
some way of bounding congestion delay over the (now known) longest possible path. The 
second method consists of having the hop count initialized to some appropriate value and 
decremented each time the packet is forwarded. The network protocol simply discards any 
packet whose hop counter becomes zero. The third method requires each packet to bear the 
time it was created, with the routers agreeing to discard any packet older than some agreed- 
upon time. This latter method requires the router clocks to be synchronized, which itself is a 
nontrivial task unless synchronization is achieved external to the network, for example by 
using GPS or some radio station that broadcasts the precise time periodically. 


In practice, we will need to guarantee not only that a packet is dead, but also that all 
acknowledgements to it are also dead, so we will now introduce 7, which is some small 
multiple of the true maximum packet lifetime. The multiple is protocol dependent and simply 
has the effect of making 7 longer. If we wait a time 7 after a packet has been sent, we can be 
sure that all traces of it are now gone and that neither it nor its acknowledgements will 
suddenly appear out of the blue to complicate matters. 


With packet lifetimes bounded, it is possible to devise a foolproof way to establish connections 
safely. The method described below is due to Tomlinson (1975). It solves the problem but 
introduces some peculiarities of its own. The method was further refined by Sunshine and 
Dalal (1978). Variants of it are widely used in practice, including in TCP. 


To get around the problem of a machine losing all memory of where it was after a crash, 
Tomlinson proposed equipping each host with a time-of-day clock. The clocks at different hosts 
need not be synchronized. Each clock is assumed to take the form of a binary counter that 
increments itself at uniform intervals. Furthermore, the number of bits in the counter must 
equal or exceed the number of bits in the sequence numbers. Last, and most important, the 
clock is assumed to continue running even if the host goes down. 


The basic idea is to ensure that two identically numbered TPDUs are never outstanding at the 
same time. When a connection is set up, the low-order k bits of the clock are used as the initial 
sequence number (also k bits). Thus, unlike our protocols of Chap. 3, each connection starts 
numbering its TPDUs with a different initial sequence number. The sequence space should be 
so large that by the time sequence numbers wrap around, old TPDUs with the same sequence 


number are long gone. This linear relation between time and initial sequence numbers is 
shown in Fig. 6-10. 


Figure 6-10. (a) TPDUs may not enter the forbidden region. (b) The 
resynchronization problem. 
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Once both transport entities have agreed on the initial sequence number, any sliding window 
protocol can be used for data flow control. In reality, the initial sequence number curve (shown 
by the heavy line) is not linear, but a staircase, since the clock advances in discrete steps. For 
simplicity we will ignore this detail. 


A problem occurs when a host crashes. When it comes up again, its transport entity does not 
know where it was in the sequence space. One solution is to require transport entities to be 
idle for 7 sec after a recovery to let all old TPDUs die off. However, in a complex internetwork, 
T may be large, so this strategy is unattractive. 


To avoid requiring T sec of dead time after a crash, it is necessary to introduce a new 
restriction on the use of sequence numbers. We can best see the need for this restriction by 
means of an example. Let 7, the maximum packet lifetime, be 60 sec and let the clock tick 
once per second. As shown by the heavy line in Fig. 6-10(a), the initial sequence number for a 
connection opened at time x will be x. Imagine that at t = 30 sec, an ordinary data TPDU being 
sent on (a previously opened) connection 5 is given sequence number 80. Call this TPDU X. 
Immediately after sending TPDU X, the host crashes and then quickly restarts. At t = 60, it 
begins reopening connections 0 through 4. At t = 70, it reopens connection 5, using initial 
sequence number 70 as required. Within the next 15 sec it sends data TPDUs 70 through 80. 
Thus, at t 2 85 a new TPDU with sequence number 80 and connection 5 has been injected into 
the subnet. Unfortunately, TPDU X still exists. If it should arrive at the receiver before the new 
TPDU 80, TPDU X will be accepted and the correct TPDU 80 will be rejected as a duplicate. 


To prevent such problems, we must prevent sequence numbers from being used (i.e., assigned 
to new TPDUS) for a time 7 before their potential use as initial sequence numbers. The illegal 
combinations of time and sequence number are shown as the forbidden region in Fig. 6- 
10(a). Before sending any TPDU on any connection, the transport entity must read the clock 
and check to see that it is not in the forbidden region. 


The protocol can get itself into trouble in two distinct ways. If a host sends too much data too 
fast on a newly-opened connection, the actual sequence number versus time curve may rise 
more steeply than the initial sequence number versus time curve. This means that the 
maximum data rate on any connection is one TPDU per clock tick. It also means that the 
transport entity must wait until the clock ticks before opening a new connection after a crash 
restart, lest the same number be used twice. Both of these points argue in favor of a short 
clock tick (a few usec or less). 


Unfortunately, entering the forbidden region from underneath by sending too fast is not the 
only way to get into trouble. From Fig. 6-10(b), we see that at any data rate less than the 
clock rate, the curve of actual sequence numbers used versus time will eventually run into the 
forbidden region from the left. The greater the slope of the actual sequence number curve, the 
longer this event will be delayed. As we stated above, just before sending every TPDU, the 
transport entity must check to see if it is about to enter the forbidden region, and if so, either 
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delay the TPDU for T sec or resynchronize the sequence numbers. 


The clock-based method solves the delayed duplicate problem for data TPDUs, but for this 
method to be useful, a connection must first be established. Since control TPDUs may also be 
delayed, there is a potential problem in getting both sides to agree on the initial sequence 
number. Suppose, for example, that connections are established by having host 1 send a 
CONNECTION REQUEST TPDU containing the proposed initial sequence number and destination 
port number to a remote peer, host 2. The receiver, host 2, then acknowledges this request by 
sending a CONNECTION ACCEPTED TPDU back. If the CONNECTION REQUEST TPDU is lost but 
a delayed duplicate CONNECTION REQUEST suddenly shows up at host 2, the connection will 
be established incorrectly. 


To solve this problem, Tomlinson (1975) introduced the three-way handshake. This 
establishment protocol does not require both sides to begin sending with the same sequence 
number, so it can be used with synchronization methods other than the global clock method. 
The normal setup procedure when host 1 initiates is shown in Fig. 6-11(a). Host 1 chooses a 
sequence number, x, and sends a CONNECTION REQUEST TPDU containing it to host 2. Host 2 
replies with an ACK TPDU acknowledging x and announcing its own initial sequence number, y. 
Finally, host 1 acknowledges host 2's choice of an initial sequence number in the first data 
TPDU that it sends. 


Figure 6-11. Three protocol scenarios for establishing a connection 
using a three-way handshake. CR denotes CONNECTION REQUEST. (a) 
Normal operation. (b) Old duplicate CONNECTION REQUEST appearing 
out of nowhere. (c) Duplicate CONNECTION REQUEST and duplicate 
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Now let us see how the three-way handshake works in the presence of delayed duplicate 
control TPDUs. In Fig. 6-11(b), the first TPDU is a delayed duplicate CONNECTION REQUEST 
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from an old connection. This TPDU arrives at host 2 without host 1's knowledge. Host 2 reacts 
to this TPDU by sending host 1 an 


ACK TPDU, in effect asking for verification that host 1 was indeed trying to set up a new 
connection. When host 1 rejects host 2's attempt to establish a connection, host 2 realizes that 
it was tricked by a delayed duplicate and abandons the connection. In this way, a delayed 
duplicate does no damage. 


The worst case is when both a delayed CONNECTION REQUEST and an ACK are floating around 
in the subnet. This case is shown in Fig. 6-11(c). As in the previous example, host 2 gets a 
delayed CONNECTION REQUEST and replies to it. At this point it is crucial to realize that host 2 
has proposed using y as the initial sequence number for host 2 to host 1 traffic, knowing full 
well that no TPDUs containing sequence number y or acknowledgements to y are still in 
existence. When the second delayed TPDU arrives at host 2, the fact that z has been 
acknowledged rather than y tells host 2 that this, too, is an old duplicate. The important thing 
to realize here is that there is no combination of old TPDUs that can cause the protocol to fail 
and have a connection set up by accident when no one wants it. 


6.1.3 Connection Release 


Releasing a connection is easier than establishing one. Nevertheless, there are more pitfalls 
than one might expect. As we mentioned earlier, there are two styles of terminating a 
connection: asymmetric release and symmetric release. Asymmetric release is the way the 
telephone system works: when one party hangs up, the connection is broken. Symmetric 


release treats the connection as two separate unidirectional connections and requires each one 
to be released separately. 


Asymmetric release is abrupt and may result in data loss. Consider the scenario of Fig. 6-12. 
After the connection is established, host 1 sends a TPDU that arrives properly at host 2. Then 
host 1 sends another TPDU. Unfortunately, host 2 issues a DISCONNECT before the second 
TPDU arrives. The result is that the connection is released and data are lost. 


Figure 6-12. Abrupt disconnection with loss of data. 
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Clearly, a more sophisticated release protocol is needed to avoid data loss. One way is to use 
symmetric release, in which each direction is released independently of the other one. Here, a 
host can continue to receive data even after it has sent a DISCONNECT TPDU. 


Symmetric release does the job when each process has a fixed amount of data to send and 
clearly knows when it has sent it. In other situations, determining that all the work has been 
done and the connection should be terminated is not so obvious. One can envision a protocol 
in which host 1 says: I am done. Are you done too? If host 2 responds: I am done too. 
Goodbye, the connection can be safely released. 
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Unfortunately, this protocol does not always work. There is a famous problem that illustrates 
this issue. It is called the two-army problem. Imagine that a white army is encamped in a 
valley, as shown in Fig. 6-13. On both of the surrounding hillsides are blue armies. The white 
army is larger than either of the blue armies alone, but together the blue armies are larger 
than the white army. If either blue army attacks by itself, it will be defeated, but if the two 
blue armies attack simultaneously, they will be victorious. 


Figure 6-13. The two-army problem. 
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The blue armies want to synchronize their attacks. However, their only communication 
medium is to send messengers on foot down into the valley, where they might be captured 
and the message lost (i.e., they have to use an unreliable communication channel). The 
question is: Does a protocol exist that allows the blue armies to win? 


Suppose that the commander of blue army £1 sends a message reading: "I propose we attack 
at dawn on March 29. How about it?" Now suppose that the message arrives, the commander 
of blue army £2 agrees, and his reply gets safely back to blue army £1. Will the attack 
happen? Probably not, because commander £2 does not know if his reply got through. If it did 
not, blue army £1 will not attack, so it would be foolish for him to charge into battle. 


Now let us improve the protocol by making it a three-way handshake. The initiator of the 
original proposal must acknowledge the response. Assuming no messages are lost, blue army 
#2 will get the acknowledgement, but the commander of blue army #1 will now hesitate. After 
all, he does not know if his acknowledgement got through, and if it did not, he knows that blue 
army #2 will not attack. We could now make a four-way handshake protocol, but that does not 
help either. 


In fact, it can be proven that no protocol exists that works. Suppose that some protocol did 
exist. Either the last message of the protocol is essential or it is not. If it is not, remove it (and 
any other unessential messages) until we are left with a protocol in which every message is 
essential. What happens if the final message does not get through? We just said that it was 
essential, so if it is lost, the attack does not take place. Since the sender of the final message 
can never be sure of its arrival, he will not risk attacking. Worse yet, the other blue army 
knows this, so it will not attack either. 


To see the relevance of the two-army problem to releasing connections, just substitute 
"disconnect" for "attack." If neither side is prepared to disconnect until it is convinced that the 
other side is prepared to disconnect too, the disconnection will never happen. 


In practice, one is usually prepared to take more risks when releasing connections than when 
attacking white armies, so the situation is not entirely hopeless. Figure 6-14 illustrates four 
scenarios of releasing using a three-way handshake. While this protocol is not infallible, it is 
usually adequate. 


279 


Figure 6-14. Four protocol scenarios for releasing a connection. (a) 
Normal case of three-way handshake. (b) Final ACK lost. (c) Response 
lost. ( d) Response lost and subsequent DRs lost. 
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In Fig. 6-14(a), we see the normal case in which one of the users sends a DR 
(DISCONNECTION REQUEST) TPDU to initiate the connection release. When it arrives, the 
recipient sends back a DR TPDU, too, and starts a timer, just in case its DR is lost. When this 
DR arrives, the original sender sends back an ACK TPDU and releases the connection. Finally, 
when the ACK TPDU arrives, the receiver also releases the connection. Releasing a connection 
means that the transport entity removes the information about the connection from its table of 
currently open connections and signals the connection's owner (the transport user) somehow. 
This action is different from a transport user issuing a DISCONNECT primitive. 


If the final ACK TPDU is lost, as shown in Fig. 6- 14(b), the situation is saved by the timer. 
When the timer expires, the connection is released anyway. 


Now consider the case of the second DR being lost. The user initiating the disconnection will 
not receive the expected response, will time out, and will start all over again. In Fig. 6-14(c) 
we see how this works, assuming that the second time no TPDUs are lost and all TPDUs are 
delivered correctly and on time. 


Our last scenario, Fig. 6-14(d), is the same as Fig. 6-14(c) except that now we assume all the 
repeated attempts to retransmit the DR also fail due to lost TPDUs. After N retries, the sender 
just gives up and releases the connection. Meanwhile, the receiver times out and also exits. 


While this protocol usually suffices, in theory it can fail if the initial DR and N retransmissions 
280 


are all lost. The sender will give up and release the connection, while the other side knows 


nothing at all about the attempts to disconnect and is still fully active. This situation results in 
a half-open connection. 


We could have avoided this problem by not allowing the sender to give up after N retries but 
forcing it to go on forever until it gets a response. However, if the other side is allowed to time 
out, then the sender will indeed go on forever, because no response will ever be forthcoming. 
If we do not allow the receiving side to time out, then the protocol hangs in Fig. 6-14(d). 


One way to kill off half-open connections is to have a rule saying that if no TPDUs have arrived 
for a certain number of seconds, the connection is then automatically disconnected. That way, 
if one side ever disconnects, the other side will detect the lack of activity and also disconnect. 
Of course, if this rule is introduced, it is necessary for each transport entity to have a timer 
that is stopped and then restarted whenever a TPDU is sent. If this timer expires, a dummy 
TPDU is transmitted, just to keep the other side from disconnecting. On the other hand, if the 
automatic disconnect rule is used and too many dummy TPDUS in a row are lost on an 
otherwise idle connection, first one side, then the other side will automatically disconnect. 


We will not belabor this point any more, but by now it should be clear that releasing a 
connection without data loss is not nearly as simple as it at first appears. 


6.1.4 Flow Control and Buffering 


Having examined connection establishment and release in some detail, let us now look at how 
connections are managed while they are in use. One of the key issues has come up before: 
flow control. In some ways the flow control problem in the transport layer is the same as in the 
data link layer, but in other ways it is different. The basic similarity is that in both layers a 
sliding window or other scheme is needed on each connection to keep a fast transmitter from 
overrunning a slow receiver. The main difference is that a router usually has relatively few 
lines, whereas a host may have numerous connections. This difference makes it impractical to 
implement the data link buffering strategy in the transport layer. 


In the data link protocols of Chap. 3, frames were buffered at both the sending router and at 
the receiving router. In protocol 6, for example, both sender and receiver are required to 
dedicate MAX SEQ + 1 buffers to each line, half for input and half for output. For a host with a 
maximum of, say, 64 connections, and a 4-bit sequence number, this protocol would require 
1024 buffers. 


In the data link layer, the sending side must buffer outgoing frames because they might have 
to be retransmitted. If the subnet provides datagram service, the sending transport entity 
must also buffer, and for the same reason. If the receiver knows that the sender buffers all 
TPDUs until they are acknowledged, the receiver may or may not dedicate specific buffers to 
specific connections, as it sees fit. The receiver may, for example, maintain a single buffer pool 
shared by all connections. When a TPDU comes in, an attempt is made to dynamically acquire 
a new buffer. If one is available, the TPDU is accepted; otherwise, it is discarded. Since the 
sender is prepared to retransmit TPDUs lost by the subnet, no harm is done by having the 
receiver drop TPDUs, although some resources are wasted. The sender just keeps trying until 
it gets an acknowledgement. 


In summary, if the network service is unreliable, the sender must buffer all TPDUs sent, just as 
in the data link layer. However, with reliable network service, other trade-offs become 
possible. In particular, if the sender knows that the receiver always has buffer space, it need 
not retain copies of the TPDUs it sends. However, if the receiver cannot guarantee that every 
incoming TPDU will be accepted, the sender will have to buffer anyway. In the latter case, the 
sender cannot trust the network layer's acknowledgement, because the acknowledgement 
means only that the TPDU arrived, not that it was accepted. We will come back to this 
important point later. 


Even if the receiver has agreed to do the buffering, there still remains the question of the 
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buffer size. If most TPDUs are nearly the same size, it is natural to organize the buffers as a 
pool of identically-sized buffers, with one TPDU per buffer, as in Fig. 6-15(a). However, if there 
is wide variation in TPDU size, from a few characters typed at a terminal to thousands of 
characters from file transfers, a pool of fixed-sized buffers presents problems. If the buffer size 
is chosen equal to the largest possible TPDU, space will be wasted whenever a short TPDU 
arrives. If the buffer size is chosen less than the maximum TPDU size, multiple buffers will be 
needed for long TPDUs, with the attendant complexity. 


Figure 6-15. (a) Chained fixed-size buffers. (b) Chained variable-sized 
buffers. (c) One large circular buffer per connection. 
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Another approach to the buffer size problem is to use variable-sized buffers, as in Fig. 6-15(b). 
The advantage here is better memory utilization, at the price of more complicated buffer 
management. A third possibility is to dedicate a single large circular buffer per connection, as 
in Fig. 6-15(c). This system also makes good use of memory, provided that all connections are 
heavily loaded, but is poor if some connections are lightly loaded. 


The optimum trade-off between source buffering and destination buffering depends on the 
type of traffic carried by the connection. For low-bandwidth bursty traffic, such as that 
produced by an interactive terminal, it is better not to dedicate any buffers, but rather to 
acquire them dynamically at both ends. Since the sender cannot be sure the receiver will be 
able to acquire a buffer, the sender must retain a copy of the TPDU until it is acknowledged. 
On the other hand, for file transfer and other high-bandwidth traffic, it is better if the receiver 
does dedicate a full window of buffers, to allow the data to flow at maximum speed. Thus, for 
low-bandwidth bursty traffic, it is better to buffer at the sender, and for highbandwidth smooth 
traffic, it is better to buffer at the receiver. 


As connections are opened and closed and as the traffic pattern changes, the sender and 
receiver need to dynamically adjust their buffer allocations. Consequently, the transport 
protocol should allow a sending host to request buffer space at the other end. Buffers could be 
allocated per connection, or collectively, for all the connections running between the two hosts. 
Alternatively, the receiver, knowing its buffer situation (but not knowing the offered traffic) 
could tell the sender "I have reserved X buffers for you." If the number of open connections 
should increase, it may be necessary for an allocation to be reduced, so the protocol should 
provide for this possibility. 


A reasonably general way to manage dynamic buffer allocation is to decouple the buffering 
from the acknowledgements, in contrast to the sliding window protocols of Chap. 3. Dynamic 
buffer management means, in effect, a variable-sized window. Initially, the sender requests a 
certain number of buffers, based on its perceived needs. The receiver then grants as many of 
these as it can afford. Every time the sender transmits a TPDU, it must decrement its 

282 


allocation, stopping altogether when the allocation reaches zero. The receiver then separately 
piggybacks both acknowledgements and buffer allocations onto the reverse traffic. 


Figure 6-16 shows an example of how dynamic window management might work in a 
datagram subnet with 4-bit sequence numbers. Assume that buffer allocation information 
travels in separate TPDUs, as shown, and is not piggybacked onto reverse traffic. Initially, A 
wants eight buffers, but is granted only four of these. It then sends three TPDUs, of which the 
third is lost. TPDU 6 acknowledges receipt of all TPDUs up to and including sequence number 
1, thus allowing A to release those buffers, and furthermore informs A that it has permission to 
send three more TPDUs starting beyond 1 (i.e., TPDUs 2, 3, and 4). A knows that it has 
already sent number 2, so it thinks that it may send TPDUs 3 and 4, which it proceeds to do. 
At this point it is Blocked and must wait for more buffer allocation. Timeout-induced 
retransmissions (line 9), however, may occur while blocked, since they use buffers that have 
already been allocated. In line 10, B acknowledges receipt of all TPDUs up to and including 4 
but refuses to let A continue. Such a situation is impossible with the fixed window protocols of 
Chap. 3. The next TPDU from B to A allocates another buffer and allows A to continue. 


Figure 6-16. Dynamic buffer allocation. The arrows show the direction 
of transmission. An ellipsis (...) indicates a lost TPDU. 
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6 -— <ack = 1, buf = 3> — B acknowledges 0 and 1, permits 2-4 
7 — «seq - 3, data - m3» — A has 1 buffer left 
8 — «Seq = 4, data = má» — A has 0 buffers left, and must stop 
9 — «seq = 2, data = m2» — A times out and retransmits 
10 = «ack = 4, buf = 0» -— Everything acknowledged, but A still blocked 
1 -— «ack = 4, buf = 1» — A may now send 5 
12 -— <ack = 4, buf = 2» — B found a new buffer somewhere 
3 — «seq = 5, data = m5» — A has 1 buffer left 
14 — «seq = 6, data = m6» — A is now blocked again 
15 = <ack = 6, buf = 0» — A is still blocked 
16 eee <ack = 6, buf = 4» — Potential deadlock 


Potential problems with buffer allocation schemes of this kind can arise in datagram networks 
if control TPDUs can get lost. Look at line 16. B has now allocated more buffers to A, but the 
allocation TPDU was lost. Since control TPDUs are not sequenced or timed out, A is now 
deadlocked. To prevent this situation, each host should periodically send control TPDUs giving 
the acknowledgement and buffer status on each connection. That way, the deadlock will be 
broken, sooner or later. 


Until now we have tacitly assumed that the only limit imposed on the sender's data rate is the 
amount of buffer space available in the receiver. As memory prices continue to fall 
dramatically, it may become feasible to equip hosts with so much memory that lack of buffers 
is rarely, if ever, a problem. 


When buffer space no longer limits the maximum flow, another bottleneck will appear: the 
carrying capacity of the subnet. If adjacent routers can exchange at most x packets/sec and 
there are k disjoint paths between a pair of hosts, there is no way that those hosts can 
exchange more than kx TPDUs/sec, no matter how much buffer space is available at each end. 
If the sender pushes too hard (i.e., sends more than kx TPDUs/sec), the subnet will become 
congested because it will be unable to deliver TPDUs as fast as they are coming in. 


What is needed is a mechanism based on the subnet's carrying capacity rather than on the 
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receiver's buffering capacity. Clearly, the flow control mechanism must be applied at the 
sender to prevent it from having too many unacknowledged TPDUs outstanding at once. 
Belsnes (1975) proposed using a sliding window flow control scheme in which the sender 
dynamically adjusts the window size to match the network's carrying capacity. If the network 
can handle c TPDUs/sec and the cycle time (including transmission, propagation, queueing, 
processing at the receiver, and return of the acknowledgement) is r, then the sender's window 
should be cr. With a window of this size the sender normally operates with the pipeline full. 
Any small decrease in network performance will cause it to block. 


In order to adjust the window size periodically, the sender could monitor both parameters and 
then compute the desired window size. The carrying capacity can be determined by simply 
counting the number of TPDUs acknowledged during some time period and then dividing by 
the time period. During the measurement, the sender should send as fast as it can, to make 
sure that the network's carrying capacity, and not the low input rate, is the factor limiting the 
acknowledgement rate. The time required for a transmitted TPDU to be acknowledged can be 
measured exactly and a running mean maintained. Since the network capacity available to any 
given flow varies in time, the window size should be adjusted frequently, to track changes in 
the carrying capacity. As we will see later, the Internet uses a similar scheme. 


6.1.5 Multiplexing 


Multiplexing several conversations onto connections, virtual circuits, and physical links plays a 
role in several layers of the network architecture. In the transport layer the need for 
multiplexing can arise in a number of ways. For example, if only one network address is 
available on a host, all transport connections on that machine have to use it. When a TPDU 
comes in, some way is needed to tell which process to give it to. This situation, called upward 
multiplexing, is shown in Fig. 6-17(a). In this figure, four distinct transport connections all 
use the same network connection (e.g., IP address) to the remote host. 


Figure 6-17. (a) Upward multiplexing. (b) Downward multiplexing. 
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Multiplexing can also be useful in the transport layer for another reason. Suppose, for 
example, that a subnet uses virtual circuits internally and imposes a maximum data rate on 
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each one. If a user needs more bandwidth than one virtual circuit can provide, a way out is to 
open multiple network connections and distribute the traffic among them on a round-robin 
basis, as indicated in Fig. 6-17(b). This modus operandi is called downward multiplexing. 
With K network connections open, the effective bandwidth is increased by a factor of k. A 
common example of downward multiplexing occurs with home users who have an ISDN line. 
This line provides for two separate connections of 64 kbps each. Using both of them to call an 
Internet provider and dividing the traffic over both lines makes it possible to achieve an 
effective bandwidth of 128 kbps. 


6.1.6 Crash Recovery 


If hosts and routers are subject to crashes, recovery from these crashes becomes an issue. If 
the transport entity is entirely within the hosts, recovery from network and router crashes is 
straightforward. If the network layer provides datagram service, the transport entities expect 
lost TPDUs all the time and know how to cope with them. If the network layer provides 
connection-oriented service, then loss of a virtual circuit is handled by establishing a new one 
and then probing the remote transport entity to ask it which TPDUs it has received and which 
ones it has not received. The latter ones can be retransmitted. 


A more troublesome problem is how to recover from host crashes. In particular, it may be 
desirable for clients to be able to continue working when servers crash and then quickly 
reboot. To illustrate the difficulty, let us assume that one host, the client, is sending a long file 
to another host, the file server, using a simple stop-and-wait protocol. The transport layer on 
the server simply passes the incoming TPDUS to the transport user, one by one. Partway 
through the transmission, the server crashes. When it comes back up, its tables are 
reinitialized, so it no longer knows precisely where it was. 


In an attempt to recover its previous status, the server might send a broadcast TPDU to all 
other hosts, announcing that it had just crashed and requesting that its clients inform it of the 
status of all open connections. Each client can be in one of two states: one TPDU outstanding, 
S1, or no TPDUs outstanding, SO. Based on only this state information, the client must decide 
whether to retransmit the most recent TPDU. 


At first glance it would seem obvious: the client should retransmit only if and only if it has an 
unacknowledged TPDU outstanding (i.e., is in state S1) when it learns of the crash. However, a 
closer inspection reveals difficulties with this naive approach. Consider, for example, the 
situation in which the server's transport entity first sends an acknowledgement, and then, 
when the acknowledgement has been sent, writes to the application process. Writing a TPDU 
onto the output stream and sending an acknowledgement are two distinct events that cannot 
be done simultaneously. If a crash occurs after the acknowledgement has been sent but before 
the write has been done, the client will receive the acknowledgement and thus be in state SO 
when the crash recovery announcement arrives. The client will therefore not retransmit, 
(incorrectly) thinking that the TPDU has arrived. This decision by the client leads to a missing 
TPDU. 


At this point you may be thinking: "That problem can be solved easily. All you have to do is 
reprogram the transport entity to first do the write and then send the acknowledgement." Try 
again. Imagine that the write has been done but the crash occurs before the acknowledgement 
can be sent. The client will be in state S1 and thus retransmit, leading to an undetected 
duplicate TPDU in the output stream to the server application process. 


No matter how the client and server are programmed, there are always situations where the 
protocol fails to recover properly. The server can be programmed in one of two ways: 
acknowledge first or write first. The client can be programmed in one of four ways: always 
retransmit the last TPDU, never retransmit the last TPDU, retransmit only in state SO, or 
retransmit only in state S1. This gives eight combinations, but as we shall see, for each 
combination there is some set of events that makes the protocol fail. 


Three events are possible at the server: sending an acknowledgement (A), writing to the 


output process (W), and crashing (C). The three events can occur in six different orderings: 
AC(W), AWC, C(AW), C(WA), WAC, and WC(A), where the parentheses are used to indicate 
that neither A nor W can follow C (i.e., once it has crashed, it has crashed). Figure 6-18 shows 
all eight combinations of client and server strategy and the valid event sequences for each 
one. Notice that for each strategy there is some sequence of events that causes the protocol to 
fail. For example, if the client always retransmits, the AWC event will generate an undetected 
duplicate, even though the other two events work properly. 


Figure 6-18. Different combinations of client and server strategy. 
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Making the protocol more elaborate does not help. Even if the client and server exchange 
several TPDUs before the server attempts to write, so that the client knows exactly what is 
about to happen, the client has no way of knowing whether a crash occurred just before or just 
after the write. The conclusion is inescapable: under our ground rules of no simultaneous 
events, host crash and recovery cannot be made transparent to higher layers. 


Put in more general terms, this result can be restated as recovery from a layer N crash can 
only be done by layer N + 1, and then only if the higher layer retains enough status 
information. As mentioned above, the transport layer can recover from failures in the network 
layer, provided that each end of a connection keeps track of where it is. 


This problem gets us into the issue of what a so-called end-to-end acknowledgement really means. 
In principle, the transport protocol is end-to-end and not chained like the lower layers. Now 
consider the case of a user entering requests for transactions against a remote database. Suppose that 
the remote transport entity is programmed to first pass TPDUS to the next layer up and then 
acknowledge. Even in this case, the receipt of an acknowledgement back at the user's machine does 
not necessarily mean that the remote host stayed up long enough to actually update the database. A 
truly end-to-end acknowledgement, whose receipt means that the work has actually been done and 
lack thereof means that it has not, is probably impossible to achieve. This point is discussed in more 
detail by Saltzer et al. (1984). 


6.2 The Internet Transport Protocols: UDP 


The Internet has two main protocols in the transport layer, a connectionless protocol and a 
connection-oriented one. In the following sections we will study both of them. The 
connectionless protocol is UDP. The connection-oriented protocol is TCP. Because UDP is 
basically just IP with a short header added, we will start with it. We will also look at two 
applications of UDP. 


6.2.1 Introduction to UDP 


The Internet protocol suite supports a connectionless transport protocol, UDP (User 
Datagram Protocol). UDP provides a way for applications to send encapsulated IP datagrams 
and send them without having to establish a connection. UDP is described in RFC 768. 


UDP transmits segments consisting of an 8-byte header followed by the payload. The header 
is shown in Fig. 6-23. The two ports serve to identify the end points within the source and 
destination machines. When a UDP packet arrives, its payload is handed to the process 
attached to the destination port. This attachment occurs when BIND primitive or something 
similar is used, as we saw in Fig. 6-6 for TCP (the binding process is the same for UDP). In 
fact, the main value of having UDP over just using raw IP is the addition of the source and 
destination ports. Without the port fields, the transport layer would not know what to do with 
the packet. With them, it delivers segments correctly. 


Figure 6-23. The UDP header. 


- 32 Bits - 
Li LL Lobo oe d LL olo d J 


Source port Destination port 


UDP length UDP checksum 


The source port is primarily needed when a reply must be sent back to the source. By copying 

the source port field from the incoming segment into the destination port field of the outgoing 

segment, the process sending the reply can specify which process on the sending machine is to 
get it. 


The UDP length field includes the 8-byte header and the data. The UDP checksum is optional 
and stored as 0 if not computed (a true computed 0 is stored as all 1s). Turning it off is foolish 
unless the quality of the data does not matter (e.g., digitized speech). 


It is probably worth mentioning explicitly some of the things that UDP does not do. It does not 
do flow control, error control, or retransmission upon receipt of a bad segment. All of that is up 
to the user processes. What it does do is provide an interface to the IP protocol with the added 
feature of demultiplexing multiple processes using the ports. That is all it does. For 
applications that need to have precise control over the packet flow, error control, or timing, 
UDP provides just what the doctor ordered. 


One area where UDP is especially useful is in client-server situations. Often, the client sends a 

short request to the server and expects a short reply back. If either the request or reply is lost, 
the client can just time out and try again. Not only is the code simple, but fewer messages are 
required (one in each direction) than with a protocol requiring an initial setup. 


An application that uses UDP this way is DNS (the Domain Name System), which we will study 
in Chap. 7. In brief, a program that needs to look up the IP address of some host name, for 
example, www.cs.berkeley.edu, can send a UDP packet containing the host name to a DNS 
server. The server replies with a UDP packet containing the host's IP address. No setup is 
needed in advance and no release is needed afterward. Just two messages go over the 
network. 


6.2.2 Remote Procedure Call 


In a certain sense, sending a message to a remote host and getting a reply back is a lot like 
making a function call in a programming language. In both cases you start with one or more 
parameters and you get back a result. This observation has led people to try to arrange 
request-reply interactions on networks to be cast in the form of procedure calls. Such an 
arrangement makes network applications much easier to program and more familiar to deal 


with. For example, just imagine a procedure named get IP address (host name) that works 
by sending a UDP packet to a DNS server and waiting for the reply, timing out and trying again 
if one is not forthcoming quickly enough. In this way, all the details of networking can be 
hidden from the programmer. 


The key work in this area was done by Birrell and Nelson (1984). In a nutshell, what Birrell 
and Nelson suggested was allowing programs to call procedures located on remote hosts. 
When a process on machine 1 calls a procedure on machine 2, the calling process on 1 is 
suspended and execution of the called procedure takes place on 2. Information can be 
transported from the caller to the callee in the parameters and can come back in the procedure 
result. No message passing is visible to the programmer. This technique is known as RPC 
(Remote Procedure Call) and has become the basis for many networking applications. 
Traditionally, the calling procedure is known as the client and the called procedure is known as 
the server, and we will use those names here too. 


The idea behind RPC is to make a remote procedure call look as much as possible like a local 
one. In the simplest form, to call a remote procedure, the client program must be bound with 
a small library procedure, called the client stub, that represents the server procedure in the 
client's address space. Similarly, the server is bound with a procedure called the server stub. 
These procedures hide the fact that the procedure call from the client to the server is not local. 


The actual steps in making an RPC are shown in Fig. 6-24. Step 1 is the client calling the client 
stub. This call is a local procedure call, with the parameters pushed onto the stack in the 
normal way. Step 2 is the client stub packing the parameters into a message and making a 
system call to send the message. Packing the parameters is called marshaling. Step 3 is the 
kernel sending the message from the client machine to the server machine. Step 4 is the 
kernel passing the incoming packet to the server stub. Finally, step 5 is the server stub calling 
the server procedure with the unmarshaled parameters. The reply traces the same path in the 
other direction. 


Figure 6-24. Steps in making a remote procedure call. The stubs are 
shaded. 
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The key item to note here is that the client procedure, written by the user, just makes a 
normal (i.e., local) procedure call to the client stub, which has the same name as the server 
procedure. Since the client procedure and client stub are in the same address space, the 
parameters are passed in the usual way. Similarly, the server procedure is called by a 
procedure in its address space with the parameters it expects. To the server procedure, 
nothing is unusual. In this way, instead of I/O being done on sockets, network communication 
is done by faking a normal procedure call. 


Despite the conceptual elegance of RPC, there are a few snakes hiding under the grass. A big 
one is the use of pointer parameters. Normally, passing a pointer to a procedure is not a 
problem. The called procedure can use the pointer in the same way the caller can because 
both procedures live in the same virtual address space. With RPC, passing pointers is 
impossible because the client and server are in different address spaces. 


In some cases, tricks can be used to make it possible to pass pointers. Suppose that the first 
parameter is a pointer to an integer, k. The client stub can marshal k and send it along to the 
server. The server stub then creates a pointer to k and passes it to the server procedure, just 
as it expects. When the server procedure returns control to the server stub, the latter sends k 
back to the client where the new k is copied over the old one, just in case the server changed 
it. In effect, the standard calling sequence of call-by-reference has been replaced by copy- 
restore. Unfortunately, this trick does not always work, for example, if the pointer points to a 
graph or other complex data structure. For this reason, some restrictions must be placed on 
parameters to procedures called remotely. 


A second problem is that in weakly-typed languages, like C, it is perfectly legal to write a 
procedure that computes the inner product of two vectors (arrays), without specifying how 
large either one is. Each could be terminated by a special value known only to the calling and 
called procedure. Under these circumstances, it is essentially impossible for the client stub to 
marshal the parameters: it has no way of determining how large they are. 


A third problem is that it is not always possible to deduce the types of the parameters, not 
even from a formal specification or the code itself. An example is printf, which may have any 


number of parameters (at least one), and the parameters can be an arbitrary mixture of 
integers, shorts, longs, characters, strings, floating-point numbers of various lengths, and 
other types. Trying to call printf as a remote procedure would be practically impossible 
because C is so permissive. However, a rule saying that RPC can be used provided that you do 
not program in C (or C++) would not be popular. 


A fourth problem relates to the use of global variables. Normally, the calling and called 
procedure can communicate by using global variables, in addition to communicating via 
parameters. If the called procedure is now moved to a remote machine, the code will fail 
because the global variables are no longer shared. 


These problems are not meant to suggest that RPC is hopeless. In fact, it is widely used, but 
some restrictions are needed to make it work well in practice. 


Of course, RPC need not use UDP packets, but RPC and UDP are a good fit and UDP is 
commonly used for RPC. However, when the parameters or results may be larger than the 
maximum UDP packet or when the operation requested is not idempotent (i.e., cannot be 
repeated safely, such as when incrementing a counter), it may be necessary to set up a TCP 
connection and send the request over it rather than use UDP. 


6.2.3 The Real-Time Transport Protocol 


Client-server RPC is one area in which UDP is widely used. Another one is real-time multimedia 
applications. In particular, as Internet radio, Internet telephony, music-on-demand, 
videoconferencing, video-on-demand, and other multimedia applications became more 
commonplace, people discovered that each application was reinventing more or less the same 
real-time transport protocol. It gradually became clear that having a generic real-time 
transport protocol for multiple applications would be a good idea. Thus was RTP (Real-time 
Transport Protocol) born. It is described in RFC 1889 and is now in widespread use. 


The position of RTP in the protocol stack is somewhat strange. It was decided to put RTP in 
user space and have it (normally) run over UDP. It operates as follows. The multimedia 
application consists of multiple audio, video, text, and possibly other streams. These are fed 
into the RTP library, which is in user space along with the application. This library then 
multiplexes the streams and encodes them in RTP packets, which it then stuffs into a socket. 
At the other end of the socket (in the operating system kernel), UDP packets are generated 
and embedded in IP packets. If the computer is on an Ethernet, the IP packets are then put in 
Ethernet frames for transmission. The protocol stack for this situation is shown in Fig. 6-25(a). 
The packet nesting is shown in Fig. 6-25(b). 


Figure 6-25. (a) The position of RTP in the protocol stack. (b) Packet 
nesting. 
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As a consequence of this design, it is a little hard to say which layer RTP is in. Since it runs in 
user space and is linked to the application program, it certainly looks like an application 


protocol. On the other hand, it is a generic, application-independent protocol that just provides 
transport facilities, so it also looks like a transport protocol. Probably the best description is 
that it is a transport protocol that is implemented in the application layer. 


The basic function of RTP is to multiplex several real-time data streams onto a single stream of 
UDP packets. The UDP stream can be sent to a single destination (unicasting) or to multiple 
destinations (multicasting). Because RTP just uses normal UDP, its packets are not treated 
specially by the routers unless some normal IP quality-of-service features are enabled. In 
particular, there are no special guarantees about delivery, jitter, etc. 


Each packet sent in an RTP stream is given a number one higher than its predecessor. This 
numbering allows the destination to determine if any packets are missing. If a packet is 
missing, the best action for the destination to take is to approximate the missing value by 
interpolation. Retransmission is not a practical option since the retransmitted packet would 
probably arrive too late to be useful. As a consequence, RTP has no flow control, no error 
control, no acknowledgements, and no mechanism to request retransmissions. 


Each RTP payload may contain multiple samples, and they may be coded any way that the 
application wants. To allow for interworking, RTP defines several profiles (e.g., a single audio 
stream), and for each profile, multiple encoding formats may be allowed. For example, a single 
audio stream may be encoded as 8-bit PCM samples at 8 kHz, delta encoding, predictive 
encoding, GSM encoding, MP3, and so on. RTP provides a header field in which the source can 
specify the encoding but is otherwise not involved in how encoding is done. 


Another facility many real-time applications need is timestamping. The idea here is to allow the 
source to associate a timestamp with the first sample in each packet. The timestamps are 
relative to the start of the stream, so only the differences between timestamps are significant. 
The absolute values have no meaning. This mechanism allows the destination to do a small 
amount of buffering and play each sample the right number of milliseconds after the start of 
the stream, independently of when the packet containing the sample arrived. Not only does 
timestamping reduce the effects of jitter, but it also allows multiple streams to be 
synchronized with each other. For example, a digital television program might have a video 
stream and two audio streams. The two audio streams could be for stereo broadcasts or for 
handling films with an original language soundtrack and a soundtrack dubbed into the local 
language, giving the viewer a choice. Each stream comes from a different physical device, but 
if they are timestamped from a single counter, they can be played back synchronously, even if 
the streams are transmitted somewhat erratically. 


The RTP header is illustrated in Fig. 6-26. It consists of three 32-bit words and potentially 
some extensions. The first word contains the Version field, which is already at 2. Let us hope 


this version is very close to the ultimate version since there is only one code point left 
(although 3 could be defined as meaning that the real version was in an extension word). 32 
bits 


Figure 6-26. The RTP header. 
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The P bit indicates that the packet has been padded to a multiple of 4 bytes. The last padding 
byte tells how many bytes were added. The X bit indicates that an extension header is present. 
The format and meaning of the extension header are not defined. The only thing that is 
defined is that the first word of the extension gives the length. This is an escape hatch for any 
unforeseen requirements. 


The CC field tells how many contributing sources are present, from 0 to 15 (see below). The M 
bit is an application-specific marker bit. It can be used to mark the start of a video frame, the 
start of a word in an audio channel, or something else that the application understands. The 
Payload type field tells which encoding algorithm has been used (e.g., uncompressed 8-bit 
audio, MP3, etc.). Since every packet carries this field, the encoding can change during 
transmission. The Sequence number is just a counter that is incremented on each RTP packet 
sent. It is used to detect lost packets. 


The timestamp is produced by the stream's source to note when the first sample in the packet 
was made. This value can help reduce jitter at the receiver by decoupling the playback from 
the packet arrival time. The Synchronization source identifier tells which stream the packet 
belongs to. It is the method used to multiplex and demultiplex multiple data streams onto a 
single stream of UDP packets. Finally, the Contributing source identifiers, if any, are used when 
mixers are present in the studio. In that case, the mixer is the synchronizing source, and the 
streams being mixed are listed here. 


RTP has a little sister protocol (little sibling protocol?) called RTCP (Realtime Transport 
Control Protocol). It handles feedback, synchronization, and the user interface but does not 
transport any data. The first function can be used to provide feedback on delay, jitter, 
bandwidth, congestion, and other network properties to the sources. This information can be 
used by the encoding process to increase the data rate (and give better quality) when the 
network is functioning well and to cut back the data rate when there is trouble in the network. 
By providing continuous feedback, the encoding algorithms can be continuously adapted to 
provide the best quality possible under the current circumstances. For example, if the 
bandwidth increases or decreases during the transmission, the encoding may switch from MP3 
to 8-bit PCM to delta encoding as required. The Payload type field is used to tell the destination 
what encoding algorithm is used for the current packet, making it possible to vary it on 
demand. 


RTCP also handles interstream synchronization. The problem is that different streams may use 
different clocks, with different granularities and different drift rates. RTCP can be used to keep 
them in sync. 


Finally, RTCP provides a way for naming the various sources (e.g., in ASCII text). This 
information can be displayed on the receiver's screen to indicate who is talking at the moment. 


More information about RTP can be found in (Perkins, 2002). 


6.3 The Internet Transport Protocols: TCP 


UDP is a simple protocol and it has some niche uses, such as client-server interactions and 
multimedia, but for most Internet applications, reliable, sequenced delivery is needed. UDP 
cannot provide this, so another protocol is required. It is called TCP and is the main workhorse 
of the Internet. Let us now study it in detail. 


6.3.1 Introduction to TCP 


TCP (Transmission Control Protocol) was specifically designed to provide a reliable end-to- 
end byte stream over an unreliable internetwork. An internetwork differs from a single network 
because different parts may have wildly different topologies, bandwidths, delays, packet sizes, 
and other parameters. TCP was designed to dynamically adapt to properties of the 
internetwork and to be robust in the face of many kinds of failures. 


TCP was formally defined in RFC 793. As time went on, various errors and inconsistencies were 
detected, and the requirements were changed in some areas. These clarifications and some 
bug fixes are detailed in RFC 1122. Extensions are given in RFC 1323. 


Each machine supporting TCP has a TCP transport entity, either a library procedure, a user 
process, or part of the kernel. In all cases, it manages TCP streams and interfaces to the IP 
layer. A TCP entity accepts user data streams from local processes, breaks them up into pieces 
not exceeding 64 KB (in practice, often 1460 data bytes in order to fit in a single Ethernet 
frame with the IP and TCP headers), and sends each piece as a separate IP datagram. When 
datagrams containing TCP data arrive at a machine, they are given to the TCP entity, which 
reconstructs the original byte streams. For simplicity, we will sometimes use just "TCP" to 
mean the TCP transport entity (a piece of software) or the TCP protocol (a set of rules). From 
the context it will be clear which is meant. For example, in "The user gives TCP the data," the 
TCP transport entity is clearly intended. 


The IP layer gives no guarantee that datagrams will be delivered properly, so it is up to TCP to 
time out and retransmit them as need be. Datagrams that do arrive may well do so in the 
wrong order; it is also up to TCP to reassemble them into messages in the proper sequence. In 
short, TCP must furnish the reliability that most users want and that IP does not provide. 


6.3.2 The TCP Service Model 


TCP service is obtained by both the sender and receiver creating end points, called sockets, as 
discussed in Sec. 6.1.3. Each socket has a socket number (address) consisting of the IP 
address of the host and a 16-bit number local to that host, called a port. A port is the TCP 
name for a TSAP. For TCP service to be obtained, a connection must be explicitly established 
between a socket on the sending machine and a socket on the receiving machine. The socket 
calls are listed in Fig. 6-5. 


A socket may be used for multiple connections at the same time. In other words, two or more 
connections may terminate at the same socket. Connections are identified by the socket 
identifiers at both ends, that is, (socket1, socket2). No virtual circuit numbers or other 
identifiers are used. 


Port numbers below 1024 are called well-known ports and are reserved for standard 
services. For example, any process wishing to establish a connection to a host to transfer a file 
using FTP can connect to the destination host's port 21 to contact its FTP daemon. The list of 


well-known ports is given at www.iana.org. Over 300 have been assigned. A few of the better 
known ones are listed in Fig. 6-27. 


Figure 6-27. Some assigned ports. 


Port | Protocol Use 
21 | FTP _ File transfer 
23 | Telnet Remote login 
25 | SMTP E-mail 
69 | TFTP Trivial file transfer protocol 
79 | Finger Lookup information about a user 
80 | HTTP | World Wide Web 
110 | POP-3 Remote e-mail access 
119 | NNTP USENET news 


It would certainly be possible to have the FTP daemon attach itself to port 21 at boot time, the 
telnet daemon to attach itself to port 23 at boot time, and so on. However, doing so would 
clutter up memory with daemons that were idle most of the time. Instead, what is generally 
done is to have a single daemon, called inetd (Internet daemon) in UNIX, attach itself to 
multiple ports and wait for the first incoming connection. When that occurs, inetd forks off a 
new process and executes the appropriate daemon in it, letting that daemon handle the 
request. In this way, the daemons other than inetd are only active when there is work for 
them to do. Inetd learns which ports it is to use from a configuration file. Consequently, the 
system administrator can set up the system to have permanent daemons on the busiest ports 
(e.g., port 80) and inetd on the rest. 


All TCP connections are full duplex and point-to-point. Full duplex means that traffic can go in 
both directions at the same time. Point-to-point means that each connection has exactly two 
end points. TCP does not support multicasting or broadcasting. 


A TCP connection is a byte stream, not a message stream. Message boundaries are not 
preserved end to end. For example, if the sending process does four 512-byte writes to a TCP 
stream, these data may be delivered to the receiving process as four 512-byte chunks, two 
1024-byte chunks, one 2048-byte chunk (see Fig. 6-28), or some other way. There is no way 
for the receiver to detect the unit(s) in which the data were written. 


Figure 6-28. (a) Four 512-byte segments sent as separate IP 
datagrams. (b) The 2048 bytes of data delivered to the application in a 
single READ call. 
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Files in UNIX have this property too. The reader of a file cannot tell whether the file was 
written a block at a time, a byte at a time, or all in one blow. As with a UNIX file, the TCP 
software has no idea of what the bytes mean and no interest in finding out. A byte is just a 
byte. 


( 


When an application passes data to TCP, TCP may send it immediately or buffer it (in order to 
collect a larger amount to send at once), at its discretion. However, sometimes, the application 
really wants the data to be sent immediately. For example, suppose a user is logged in to a 
remote machine. After a command line has been finished and the carriage return typed, it is 
essential that the line be shipped off to the remote machine immediately and not buffered until 


the next line comes in. To force data out, applications can use the PUSH flag, which tells TCP 
not to delay the transmission. 


Some early applications used the PUSH flag as a kind of marker to delineate messages 
boundaries. While this trick sometimes works, it sometimes fails since not all implementations 
of TCP pass the PUSH flag to the application on the receiving side. Furthermore, if additional 
PUSHes come in before the first one has been transmitted (e.g., because the output line is 
busy), TCP is free to collect all the PUSHed data into a single IP datagram, with no separation 
between the various pieces. 


One last feature of the TCP service that is worth mentioning here is urgent data. When an 
interactive user hits the DEL or CTRL-C key to break off a remote computation that has already 
begun, the sending application puts some control information in the data stream and gives it to 
TCP along with the URGENT flag. This event causes TCP to stop accumulating data and 
transmit everything it has for that connection immediately. 


When the urgent data are received at the destination, the receiving application is interrupted 
(e.g., given a signal in UNIX terms) so it can stop whatever it was doing and read the data 
stream to find the urgent data. The end of the urgent data is marked so the application knows 
when it is over. The start of the urgent data is not marked. It is up to the application to figure 
that out. This scheme basically provides a crude signaling mechanism and leaves everything 
else up to the application. 


6.3.3 The TCP Protocol 


In this section we will give a general overview of the TCP protocol. In the next one we will go 
over the protocol header, field by field. 


A key feature of TCP, and one which dominates the protocol design, is that every byte on a 
TCP connection has its own 32-bit sequence number. When the Internet began, the lines 
between routers were mostly 56-kbps leased lines, so a host blasting away at full speed took 
over 1 week to cycle through the sequence numbers. At modern network speeds, the sequence 
numbers can be consumed at an alarming rate, as we will see later. Separate 32-bit sequence 
numbers are used for acknowledgements and for the window mechanism, as discussed below. 


The sending and receiving TCP entities exchange data in the form of segments. A TCP 
segment consists of a fixed 20-byte header (plus an optional part) followed by zero or more 
data bytes. The TCP software decides how big segments should be. It can accumulate data 
from several writes into one segment or can split data from one write over multiple segments. 
Two limits restrict the segment size. First, each segment, including the TCP header, must fit in 
the 65,515-byte IP payload. Second, each network has a maximum transfer unit, or MTU, 
and each segment must fit in the MTU. In practice, the MTU is generally 1500 bytes (the 
Ethernet payload size) and thus defines the upper bound on segment size. 


The basic protocol used by TCP entities is the sliding window protocol. When a sender 
transmits a segment, it also starts a timer. When the segment arrives at the destination, the 
receiving TCP entity sends back a segment (with data if any exist, otherwise without data) 
bearing an acknowledgement number equal to the next sequence number it expects to receive. 
If the sender's timer goes off before the acknowledgement is received, the sender transmits 
the segment again. 


Although this protocol sounds simple, there are a number of sometimes subtle ins and outs, 
which we will cover below. Segments can arrive out of order, so bytes 3072-4095 can arrive 
but cannot be acknowledged because bytes 2048--3071 have not turned up yet. Segments 
can also be delayed so long in transit that the sender times out and retransmits them. The 

retransmissions may include different byte ranges than the original transmission, requiring a 


careful administration to keep track of which bytes have been correctly received so far. 


However, since each byte in the stream has its own unique offset, it can be done. 


TCP must be prepared to deal with these problems and solve them in an efficient way. A 
considerable amount of effort has gone into optimizing the performance of TCP streams, even 
in the face of network problems. A number of the algorithms used by many TCP 
implementations will be discussed below. 


6.3.4 The TCP Segment Header 


Figure 6-29 shows the layout of a TCP segment. Every segment begins with a fixed-format, 
20-byte header. The fixed header may be followed by header options. After the options, if any, 
up to 65,535 - 20 - 20 = 65,495 data bytes may follow, where the first 20 refer to the IP 
header and the second to the TCP header. Segments without any data are legal and are 
commonly used for acknowledgements and control messages. 


Figure 6-29. The TCP header. 
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Let us dissect the TCP header field by field. The Source port and Destination port fields identify 
the local end points of the connection. The well-known ports are defined at www.iana.org but 
each host can allocate the others as it wishes. A port plus its host's IP address forms a 48-bit 
unique end point. The source and destination end points together identify the connection. 


The Sequence number and Acknowledgement number fields perform their usual functions. 
Note that the latter specifies the next byte expected, not the last byte correctly received. Both 
are 32 bits long because every byte of data is numbered in a TCP stream. 


The TCP header length tells how many 32-bit words are contained in the TCP header. This 
information is needed because the Options field is of variable length, so the header is, too. 
Technically, this field really indicates the start of the data within the segment, measured in 32- 
bit words, but that number is just the header length in words, so the effect is the same. 


Next comes a 6-bit field that is not used. The fact that this field has survived intact for over a 
quarter of a century is testimony to how well thought out TCP is. Lesser protocols would have 
needed it to fix bugs in the original design. 


Now come six 1-bit flags. URG is set to 1 if the Urgent pointer is in use. The Urgent pointer is 
used to indicate a byte offset from the current sequence number at which urgent data are to 
be found. This facility is in lieu of interrupt messages. As we mentioned above, this facility is a 


bare-bones way of allowing the sender to signal the receiver without getting TCP itself involved 
in the reason for the interrupt. 


The ACK bit is set to 1 to indicate that the Acknowledgement number is valid. If ACK is O, the 
segment does not contain an acknowledgement so the Acknowledgement number field is 
ignored. 


The PSH bit indicates PUSHed data. The receiver is hereby kindly requested to deliver the data 
to the application upon arrival and not buffer it until a full buffer has been received (which it 
might otherwise do for efficiency). 


The RST bit is used to reset a connection that has become confused due to a host crash or 
some other reason. It is also used to reject an invalid segment or refuse an attempt to open a 
connection. In general, if you get a segment with the RST bit on, you have a problem on your 
hands. 


The SYN bit is used to establish connections. The connection request has SYN = 1 and ACK = 0 
to indicate that the piggyback acknowledgement field is not in use. The connection reply does 
bear an acknowledgement, so it has SYN = 1 and ACK = 1. In essence the SYN bit is used to 
denote CONNECTION REQUEST and CONNECTION ACCEPTED, with the ACK bit used to 
distinguish between those two possibilities. 


The FIN bit is used to release a connection. It specifies that the sender has no more data to 
transmit. However, after closing a connection, the closing process may continue to receive 
data indefinitely. Both SYN and FIN segments have sequence numbers and are thus 
guaranteed to be processed in the correct order. 


Flow control in TCP is handled using a variable-sized sliding window. The Window size field tells 
how many bytes may be sent starting at the byte acknowledged. A Window size field of 0 is 
legal and says that the bytes up to and including Acknowledgement number - 1 have been 
received, but that the receiver is currently badly in need of a rest and would like no more data 
for the moment, thank you. The receiver can later grant permission to send by transmitting a 
segment with the same Acknowledgement number and a nonzero Window size field. 


In the protocols of Chap. 3, acknowledgements of frames received and permission to send new 
frames were tied together. This was a consequence of a fixed window size for each protocol. In 
TCP, acknowledgements and permission to send additional data are completely decoupled. In 
effect, a receiver can say: I have received bytes up through k but I do not want any more just 
now. This decoupling (in fact, a variable-sized window) gives additional flexibility. We will 
study it in detail below. 


A Checksum is also provided for extra reliability. It checksums the header, the data, and the 
conceptual pseudoheader shown in Fig. 6-30. When performing this computation, the TCP 
Checksum field is set to zero and the data field is padded out with an additional zero byte if its 
length is an odd number. The checksum algorithm is simply to add up all the 16-bit words in 
one's complement and then to take the one's complement of the sum. As a consequence, 
when the receiver performs the calculation on the entire segment, including the Checksum 
field, the result should be O. 


Figure 6-30. The pseudoheader included in the TCP checksum. 
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The pseudoheader contains the 32-bit IP addresses of the source and destination machines, 
the protocol number for TCP (6), and the byte count for the TCP segment (including the 
header). Including the pseudoheader in the TCP checksum computation helps detect 
misdelivered packets, but including it also violates the protocol hierarchy since the IP 
addresses in it belong to the IP layer, not to the TCP layer. UDP uses the same pseudoheader 
for its checksum. 


The Options field provides a way to add extra facilities not covered by the regular header. The 
most important option is the one that allows each host to specify the maximum TCP payload it 
is willing to accept. Using large segments is more efficient than using small ones because the 
20-byte header can then be amortized over more data, but small hosts may not be able to 
handle big segments. During connection setup, each side can announce its maximum and see 
its partner's. If a host does not use this option, it defaults to a 536-byte payload. All Internet 
hosts are required to accept TCP segments of 536 + 20 = 556 bytes. The maximum segment 
size in the two directions need not be the same. 


For lines with high bandwidth, high delay, or both, the 64-KB window is often a problem. On a 
T3 line (44.736 Mbps), it takes only 12 msec to output a full 64-KB window. If the round-trip 
propagation delay is 50 msec (which is typical for a transcontinental fiber), the sender will be 
idle 3/4 of the time waiting for acknowledgements. On a satellite connection, the situation is 
even worse. A larger window size would allow the sender to keep pumping data out, but using 
the 16-bit Window size field, there is no way to express such a size. In RFC 1323, a Window 
scale option was proposed, allowing the sender and receiver to negotiate a window scale 
factor. This number allows both sides to shift the Window size field up to 14 bits to the left, 
thus allowing windows of up to 2% bytes. Most TCP implementations now support this option. 


Another option proposed by RFC 1106 and now widely implemented is the use of the selective 
repeat instead of go back n protocol. If the receiver gets one bad segment and then a large 
number of good ones, the normal TCP protocol will eventually time out and retransmit all the 
unacknowledged segments, including all those that were received correctly (i.e., the go back n 
protocol). RFC 1106 introduced NAKSs to allow the receiver to ask for a specific segment (or 
segments). After it gets these, it can acknowledge all the buffered data, thus reducing the 
amount of data retransmitted. 


6.3.5 TCP Connection Establishment 


Connections are established in TCP by means of the three-way handshake discussed in Sec. 
6.2.2. To establish a connection, one side, say, the server, passively waits for an incoming 
connection by executing the LISTEN and ACCEPT primitives, either specifying a specific source 
or nobody in particular. 


The other side, say, the client, executes a CONNECT primitive, specifying the IP address and 
port to which it wants to connect, the maximum TCP segment size it is willing to accept, and 
optionally some user data (e.g., a password). The CONNECT primitive sends a TCP segment 
with the SYN bit on and ACK bit off and waits for a response. 


When this segment arrives at the destination, the TCP entity there checks to see if there is a 
process that has done a LISTEN on the port given in the Destination port field. If not, it sends 
a reply with the RST bit on to reject the connection. 


If some process is listening to the port, that process is given the incoming TCP segment. It can 
then either accept or reject the connection. If it accepts, an acknowledgement segment is sent 
back. The sequence of TCP segments sent in the normal case is shown in Fig. 6-31(a). Note 
that a SYN segment consumes 1 byte of sequence space so that it can be acknowledged 
unambiguously. 


Figure 6-31. (a) TCP connection establishment in the normal case. (b) 
Call collision. 
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In the event that two hosts simultaneously attempt to establish a connection between the 
same two sockets, the sequence of events is as illustrated in Fig. 6-31(b). The result of these 
events is that just one connection is established, not two because connections are identified by 
their end points. If the first setup results in a connection identified by (x, y) and the second 
one does too, only one table entry is made, namely, for (x, y). 


The initial sequence number on a connection is not 0 for the reasons we discussed earlier. A 
clock-based scheme is used, with a clock tick every 4 usec. For additional safety, when a host 
crashes, it may not reboot for the maximum packet lifetime to make sure that no packets from 
previous connections are still roaming around the Internet somewhere. 


6.3.6 TCP Connection Release 


Although TCP connections are full duplex, to understand how connections are released it is 
best to think of them as a pair of simplex connections. Each simplex connection is released 
independently of its sibling. To release a connection, either party can send a TCP segment with 
the FIN bit set, which means that it has no more data to transmit. When the FIN is 
acknowledged, that direction is shut down for new data. Data may continue to flow indefinitely 
in the other direction, however. When both directions have been shut down, the connection is 
released. Normally, four TCP segments are needed to release a connection, one FIN and one 
ACK for each direction. However, it is possible for the first ACK and the second FIN to be 
contained in the same segment, reducing the total count to three. 


Just as with telephone calls in which both people say goodbye and hang up the phone 
simultaneously, both ends of a TCP connection may send FIN segments at the same time. 
These are each acknowledged in the usual way, and the connection is shut down. There is, in 
fact, no essential difference between the two hosts releasing sequentially or simultaneously. 


To avoid the two-army problem, timers are used. If a response to a FIN is not forthcoming 


within two maximum packet lifetimes, the sender of the FIN releases the connection. The other 
side will eventually notice that nobody seems to be listening to it any more and will time out as 
well. While this solution is not perfect, given the fact that a perfect solution is theoretically 
impossible, it will have to do. In practice, problems rarely arise. 


6.3.7 TCP Connection Management Modeling 


The steps required to establish and release connections can be represented in a finite state 
machine with the 11 states listed in Fig. 6-32. In each state, certain events are legal. When a 
legal event happens, some action may be taken. If some other event happens, an error is 
reported. 


Figure 6-32. The states used in the TCP connection management finite 
state machine. 


State Description 
CLOSED No connection is active or pending 
LISTEN The server is waiting for an incoming call 
SYN RCVD A connection request has arrived; wait for ACK 
SYN SENT | The application has started to open a connection 
ESTABLISHED | The normal data transfer state 
FIN WAIT 1 The application has said it is finished 
FIN WAIT 2 The other side has agreed to release 
TIMED WAIT | Wait for all packets to die off 
CLOSING | Both sides have tried to close simultaneously 
CLOSE WAIT The other side has initiated a release 
LAST ACK Wait for all packets to die off 


Each connection starts in the CLOSED state. It leaves that state when it does either a passive 
open (LISTEN), or an active open (CONNECT). If the other side does the opposite one, a 
connection is established and the state becomes ESTABLISHED. Connection release can be 
initiated by either side. When it is complete, the state returns to CLOSED. 


The finite state machine itself is shown in Fig. 6-33. The common case of a client actively 
connecting to a passive server is shown with heavy lines—solid for the client, dotted for the 
server. The lightface lines are unusual event sequences. Each line in Fig. 6-33 is marked by an 
event/action pair. The event can either be a user-initiated system call (CONNECT, LISTEN, 
SEND, or CLOSE), a segment arrival (SYN, FIN, ACK, or RST), or, in one case, a timeout of 
twice the maximum packet lifetime. The action is the sending of a control segment (SYN, FIN, 
or RST) or nothing, indicated by —. Comments are shown in parentheses. 


Figure 6-33. TCP connection management finite state machine. The 
heavy solid line is the normal path for a client. The heavy dashed line 
is the normal path for a server. The light lines are unusual events. 
Each transition is labeled by the event causing it and the action 
resulting from it, separated by a slash. 
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One can best understand the diagram by first following the path of a client (the heavy solid 
line), then later following the path of a server (the heavy dashed line). When an application 
program on the client machine issues a CONNECT request, the local TCP entity creates a 
connection record, marks it as being in the SYN SENT state, and sends a SYN segment. Note 
that many connections may be open (or being opened) at the same time on behalf of multiple 
applications, so the state is per connection and recorded in the connection record. When the 
SYN+ACK arrives, TCP sends the final ACK of the three-way handshake and switches into the 
ESTABLISHED state. Data can now be sent and received. 


When an application is finished, it executes a CLOSE primitive, which causes the local TCP 
entity to send a FIN segment and wait for the corresponding ACK (dashed box marked active 
close). When the ACK arrives, a transition is made to state FIN WAIT 2 and one direction of the 
connection is now closed. When the other side closes, too, a FIN comes in, which is 
acknowledged. Now both sides are closed, but TCP waits a time equal to the maximum packet 
lifetime to guarantee that all packets from the connection have died off, just in case the 
acknowledgement was lost. When the timer goes off, TCP deletes the connection record. 


Now let us examine connection management from the server's viewpoint. The server does a 
LISTEN and settles down to see who turns up. When a SYN comes in, it is acknowledged and 
the server goes to the SYN RCVD state. When the server's SYN is itself acknowledged, the 
three-way handshake is complete and the server goes to the ESTABLISHED state. Data 
transfer can now occur. 


When the client is done, it does a CLOSE, which causes a FIN to arrive at the server (dashed 
box marked passive close). The server is then signaled. When it, too, does a CLOSE, a FIN is 
sent to the client. When the client's acknowledgement shows up, the server releases the 
connection and deletes the connection record. 


6.3.8 TCP Transmission Policy 


As mentioned earlier, window management in TCP is not directly tied to acknowledgements as 
it is in most data link protocols. For example, suppose the receiver has a 4096-byte buffer, as 
shown in Fig. 6-34. If the sender transmits a 2048-byte segment that is correctly received, the 
receiver will acknowledge the segment. However, since it now has only 2048 bytes of buffer 
space (until the application removes some data from the buffer), it will advertise a window of 
2048 starting at the next byte expected. 


Figure 6-34. Window management in TCP. 
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Now the sender transmits another 2048 bytes, which are acknowledged, but the advertised 
window is 0. The sender must stop until the application process on the receiving host has 
removed some data from the buffer, at which time TCP can advertise a larger window. 


When the window is 0, the sender may not normally send segments, with two exceptions. 
First, urgent data may be sent, for example, to allow the user to kill the process running on 
the remote machine. Second, the sender may send a 1-byte segment to make the receiver 
reannounce the next byte expected and window size. The TCP standard explicitly provides this 
option to prevent deadlock if a window announcement ever gets lost. 


Senders are not required to transmit data as soon as they come in from the application. 
Neither are receivers required to send acknowledgements as soon as possible. For example, in 
Fig. 6-34, when the first 2 KB of data came in, TCP, knowing that it had a 4-KB window 
available, would have been completely correct in just buffering the data until another 2 KB 
came in, to be able to transmit a segment with a 4-KB payload. This freedom can be exploited 
to improve performance. 


Consider a telnet connection to an interactive editor that reacts on every keystroke. In the 
worst case, when a character arrives at the sending TCP entity, TCP creates a 21-byte TCP 


segment, which it gives to IP to send as a 41-byte IP datagram. At the receiving side, TCP 


immediately sends a 40-byte acknowledgement (20 bytes of TCP header and 20 bytes of IP 
header). Later, when the editor has read the byte, TCP sends a window update, moving the 
window 1 byte to the right. This packet is also 40 bytes. Finally, when the editor has processed 
the character, it echoes the character as a 41-byte packet. In all, 162 bytes of bandwidth are 
used and four segments are sent for each character typed. When bandwidth is scarce, this 
method of doing business is not desirable. 


One approach that many TCP implementations use to optimize this situation is to delay 
acknowledgements and window updates for 500 msec in the hope of acquiring some data on 
which to hitch a free ride. Assuming the editor echoes within 500 msec, only one 41-byte 
packet now need be sent back to the remote user, cutting the packet count and bandwidth 
usage in half. 


Although this rule reduces the load placed on the network by the receiver, the sender is still 
operating inefficiently by sending 41-byte packets containing 1 byte of data. A way to reduce 
this usage is known as Nagle's algorithm (Nagle, 1984). What Nagle suggested is simple: 
when data come into the sender one byte at a time, just send the first byte and buffer all the 
rest until the outstanding byte is acknowledged. Then send all the buffered characters in one 
TCP segment and start buffering again until they are all acknowledged. If the user is typing 
quickly and the network is slow, a substantial number of characters may go in each segment, 
greatly reducing the bandwidth used. The algorithm additionally allows a new packet to be 
sent if enough data have trickled in to fill half the window or a maximum segment. 


Nagle's algorithm is widely used by TCP implementations, but there are times when it is better 
to disable it. In particular, when an X Windows application is being run over the Internet, 
mouse movements have to be sent to the remote computer. (The X Window system is the 
windowing system used on most UNIX systems.) Gathering them up to send in bursts makes 
the mouse cursor move erratically, which makes for unhappy users. 


Another problem that can degrade TCP performance is the silly window syndrome (Clark, 
1982). This problem occurs when data are passed to the sending TCP entity in large blocks, 
but an interactive application on the receiving side reads data 1 byte at a time. To see the 
problem, look at Fig. 6-35. Initially, the TCP buffer on the receiving side is full and the sender 
knows this (i.e., has a window of size 0). Then the interactive application reads one character 
from the TCP stream. This action makes the receiving TCP happy, so it sends a window update 
to the sender saying that it is all right to send 1 byte. The sender obliges and sends 1 byte. 
The buffer is now full, so the receiver acknowledges the 1-byte segment but sets the window 
to 0. This behavior can go on forever. 


Figure 6-35. Silly window syndrome. 
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Clark's solution is to prevent the receiver from sending a window update for 1 byte. Instead it 
is forced to wait until it has a decent amount of space available and advertise that instead. 
Specifically, the receiver should not send a window update until it can handle the maximum 
segment size it advertised when the connection was established or until its buffer is half 
empty, whichever is smaller. 


Furthermore, the sender can also help by not sending tiny segments. Instead, it should try to 
wait until it has accumulated enough space in the window to send a full segment or at least 
one containing half of the receiver's buffer size (which it must estimate from the pattern of 
window updates it has received in the past). 


Nagle's algorithm and Clark's solution to the silly window syndrome are complementary. Nagle 
was trying to solve the problem caused by the sending application delivering data to TCP a 
byte at a time. Clark was trying to solve the problem of the receiving application sucking the 
data up from TCP a byte at a time. Both solutions are valid and can work together. The goal is 
for the sender not to send small segments and the receiver not to ask for them. 


The receiving TCP can go further in improving performance than just doing window updates in 
large units. Like the sending TCP, it can also buffer data, so it can block a READ request from 

the application until it has a large chunk of data to provide. Doing this reduces the number of 
calls to TCP, and hence the overhead. Of course, it also increases the response time, but for 

noninteractive applications like file transfer, efficiency may be more important than response 
time to individual requests. 


Another receiver issue is what to do with out-of-order segments. They can be kept or 
discarded, at the receiver's discretion. Of course, acknowledgements can be sent only when all 
the data up to the byte acknowledged have been received. If the receiver gets segments O0, 1, 
2, 4, 5, 6, and 7, it can acknowledge everything up to and including the last byte in segment 
2. When the sender times out, it then retransmits segment 3. If the receiver has buffered 
segments 4 through 7, upon receipt of segment 3 it can acknowledge all bytes up to the end of 
segment 7. 


6.3.9 TCP Congestion Control 


When the load offered to any network is more than it can handle, congestion builds up. The 
Internet is no exception. In this section we will discuss algorithms that have been developed 
over the past quarter of a century to deal with congestion. Although the network layer also 


tries to manage congestion, most of the heavy lifting is done by TCP because the real solution 
to congestion is to slow down the data rate. 


In theory, congestion can be dealt with by employing a principle borrowed from physics: the 
law of conservation of packets. The idea is to refrain from injecting a new packet into the 
network until an old one leaves (i.e., is delivered). TCP attempts to achieve this goal by 
dynamically manipulating the window size. 


The first step in managing congestion is detecting it. In the old days, detecting congestion was 
difficult. A timeout caused by a lost packet could have been caused by either (1) noise on a 
transmission line or (2) packet discard at a congested router. Telling the difference was 
difficult. 


Nowadays, packet loss due to transmission errors is relatively rare because most long-haul 
trunks are fiber (although wireless networks are a different story). Consequently, most 
transmission timeouts on the Internet are due to congestion. All the Internet TCP algorithms 
assume that timeouts are caused by congestion and monitor timeouts for signs of trouble the 
way miners watch their canaries. 


Before discussing how TCP reacts to congestion, let us first describe what it does to try to 


prevent congestion from occurring in the first place. When a connection is established, a 
suitable window size has to be chosen. The receiver can specify a window based on its buffer 
size. If the sender sticks to this window size, problems will not occur due to buffer overflow at 
the receiving end, but they may still occur due to internal congestion within the network. 


In Fig. 6-36, we see this problem illustrated hydraulically. In Fig. 6-36(a), we see a thick pipe 
leading to a small-capacity receiver. As long as the sender does not send more water than the 
bucket can contain, no water will be lost. In Fig. 6-36(b), the limiting factor is not the bucket 
capacity, but the internal carrying capacity of the network. If too much water comes in too 
fast, it will back up and some will be lost (in this case by overflowing the funnel). 


Figure 6-36. (a) A fast network feeding a low-capacity receiver. (b) A 
slow network feeding a high-capacity receiver. 
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The Internet solution is to realize that two potential problems exist—network capacity and 
receiver capacity—and to deal with each of them separately. To do so, each sender maintains 
two windows: the window the receiver has granted and a second window, the congestion 
window. Each reflects the number of bytes the sender may transmit. The number of bytes 
that may be sent is the minimum of the two windows. Thus, the effective window is the 
minimum of what the sender thinks is all right and what the receiver thinks is all right. If the 
receiver says "Send 8 KB" but the sender knows that bursts of more than 4 KB clog the 
network, it sends 4 KB. On the other hand, if the receiver says "Send 8 KB" and the sender 
knows that bursts of up to 32 KB get through effortlessly, it sends the full 8 KB requested. 


When a connection is established, the sender initializes the congestion window to the size of 
the maximum segment in use on the connection. It then sends one maximum segment. If this 
segment is acknowledged before the timer goes off, it adds one segment's worth of bytes to 
the congestion window to make it two maximum size segments and sends two segments. As 
each of these segments is acknowledged, the congestion window is increased by one 
maximum segment size. When the congestion window is n segments, if all n are acknowledged 
on time, the congestion window is increased by the byte count corresponding to n segments. 
In effect, each burst acknowledged doubles the congestion window. 


The congestion window keeps growing exponentially until either a timeout occurs or the 
receiver's window is reached. The idea is that if bursts of size, say, 1024, 2048, and 4096 
bytes work fine but a burst of 8192 bytes gives a timeout, the congestion window should be 


set to 4096 to avoid congestion. As long as the congestion window remains at 4096, no bursts 
longer than that will be sent, no matter how much window space the receiver grants. This 
algorithm is called slow start, but it is not slow at all (Jacobson, 1988). It is exponential. All 
TCP implementations are required to support it. 


Now let us look at the Internet congestion control algorithm. It uses a third parameter, the 
threshold, initially 64 KB, in addition to the receiver and congestion windows. When a timeout 
occurs, the threshold is set to half of the current congestion window, and the congestion 
window is reset to one maximum segment. Slow start is then used to determine what the 
network can handle, except that exponential growth stops when the threshold is hit. From that 
point on, successful transmissions grow the congestion window linearly (by one maximum 
segment for each burst) instead of one per segment. In effect, this algorithm is guessing that 
it is probably acceptable to cut the congestion window in half, and then it gradually works its 
way up from there. 


As an illustration of how the congestion algorithm works, see Fig. 6-37. The maximum 
segment size here is 1024 bytes. Initially, the congestion window was 64 KB, but a timeout 
occurred, so the threshold is set to 32 KB and the congestion window to 1 KB for transmission 
0 here. The congestion window then grows exponentially until it hits the threshold (32 KB). 
Starting then, it grows linearly. 


Figure 6-37. An example of the Internet congestion algorithm. 
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Transmission 13 is unlucky (it should have known) and a timeout occurs. The threshold is set 
to half the current window (by now 40 KB, so half is 20 KB), and slow start is initiated all over 
again. When the acknowledgements from transmission 14 start coming in, the first four each 

double the congestion window, but after that, growth becomes linear again. 


If no more timeouts occur, the congestion window will continue to grow up to the size of the 
receiver's window. At that point, it will stop growing and remain constant as long as there are 
no more timeouts and the receiver's window does not change size. As an aside, if an ICMP 
SOURCE QUENCH packet comes in and is passed to TCP, this event is treated the same way as 
a timeout. An alternative (and more recent approach) is described in RFC 3168. 


6.3.10 TCP Timer Management 


TCP uses multiple timers (at least conceptually) to do its work. The most important of these is 
the retransmission timer. When a segment is sent, a retransmission timer is started. If the 
segment is acknowledged before the timer expires, the timer is stopped. If, on the other hand, 
the timer goes off before the acknowledgement comes in, the segment is retransmitted (and 
the timer started again). The question that arises is: How long should the timeout interval be? 


This problem is much more difficult in the Internet transport layer than in the generic data link 
protocols of Chap. 3. In the latter case, the expected delay is highly predictable (i.e., has a low 
variance), so the timer can be set to go off just slightly after the acknowledgement is 
expected, as shown in Fig. 6-38(a). Since acknowledgements are rarely delayed in the data 
link layer (due to lack of congestion), the absence of an acknowledgement at the expected 
time generally means either the frame or the acknowledgement has been lost. 


Figure 6-38. (a) Probability density of acknowledgement arrival times 
in the data link layer. (b) Probability density of acknowledgement 
arrival times for TCP. 
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TCP is faced with a radically different environment. The probability density function for the 
time it takes for a TCP acknowledgement to come back looks more like Fig. 6-38(b) than Fig. 
6-38(a). Determining the round-trip time to the destination is tricky. Even when it is known, 
deciding on the timeout interval is also difficult. If the timeout is set too short, say, T 1 in Fig. 
6-38(b), unnecessary retransmissions will occur, clogging the Internet with useless packets. If 
it is set too long, (e.g., 7 2), performance will suffer due to the long retransmission delay 
whenever a packet is lost. Furthermore, the mean and variance of the acknowledgement 
arrival distribution can change rapidly within a few seconds as congestion builds up or is 
resolved. 


The solution is to use a highly dynamic algorithm that constantly adjusts the timeout interval, 
based on continuous measurements of network performance. The algorithm generally used by 
TCP is due to Jacobson (1988) and works as follows. For each connection, TCP maintains a 
variable, RTT, that is the best current estimate of the round-trip time to the destination in 
question. When a segment is sent, a timer is started, both to see how long the 
acknowledgement takes and to trigger a retransmission if it takes too long. If the 
acknowledgement gets back before the timer expires, TCP measures how long the 
acknowledgement took, say, M. It then updates RTT according to the formula 


RTT = &RTT + (1 — &)M 


where a is a smoothing factor that determines how much weight is given to the old value. 
Typically oa = 7/8. 


Even given a good value of RTT, choosing a suitable retransmission timeout is a nontrivial 
matter. Normally, TCP uses BRTT, but the trick is choosing p. In the initial implementations, f 
was always 2, but experience showed that a constant value was inflexible because it failed to 
respond when the variance went up. 


In 1988, Jacobson proposed making B roughly proportional to the standard deviation of the 
acknowledgement arrival time probability density function so that a large variance means a 
large B, and vice versa. In particular, he suggested using the mean deviation as a cheap 
estimator of the standard deviation. His algorithm requires keeping track of another smoothed 
variable, D, the deviation. Whenever an acknowledgement comes in, the difference between 
the expected and observed values, | RTT - M |, is computed. A smoothed value of this is 
maintained in D by the formula 


D -aD *(1-o) IRTT - MI 


where a may or may not be the same value used to smooth RTT. While D is not exactly the 
same as the standard deviation, it is good enough and Jacobson showed how it could be 
computed using only integer adds, subtracts, and shifts—a big plus. Most TCP implementations 
now use this algorithm and set the timeout interval to 


Timeout = RTT +4 x D 


The choice of the factor 4 is somewhat arbitrary, but it has two advantages. First, 
multiplication by 4 can be done with a single shift. Second, it minimizes unnecessary timeouts 
and retransmissions because less than 1 percent of all packets come in more than four 
standard deviations late. (Actually, Jacobson initially said to use 2, but later work has shown 
that 4 gives better performance.) 


One problem that occurs with the dynamic estimation of RTT is what to do when a segment 
times out and is sent again. When the acknowledgement comes in, it is unclear whether the 
acknowledgement refers to the first transmission or a later one. Guessing wrong can seriously 
contaminate the estimate of RTT. Phil Karn discovered this problem the hard way. He is an 
amateur radio enthusiast interested in transmitting TCP/IP packets by ham radio, a notoriously 
unreliable medium (on a good day, half the packets get through). He made a simple proposal: 
do not update RTT on any segments that have been retransmitted. Instead, the timeout is 
doubled on each failure until the segments get through the first time. This fix is called Karn's 
algorithm. Most TCP implementations use it. 


The retransmission timer is not the only timer TCP uses. A second timer is the persistence 
timer. It is designed to prevent the following deadlock. The receiver sends an 
acknowledgement with a window size of 0, telling the sender to wait. Later, the receiver 
updates the window, but the packet with the update is lost. Now both the sender and the 
receiver are waiting for each other to do something. When the persistence timer goes off, the 
sender transmits a probe to the receiver. The response to the probe gives the window size. If 
it is still zero, the persistence timer is set again and the cycle repeats. If it is nonzero, data can 
now be sent. 


A third timer that some implementations use is the keepalive timer. When a connection has 
been idle for a long time, the keepalive timer may go off to cause one side to check whether 
the other side is still there. If it fails to respond, the connection is terminated. This feature is 
controversial because it adds overhead and may terminate an otherwise healthy connection 
due to a transient network partition. 


The last timer used on each TCP connection is the one used in the TIMED WAIT state while 
closing. It runs for twice the maximum packet lifetime to make sure that when a connection is 
closed, all packets created by it have died off. 


6.3.11 Wireless TCP and UDP 


In theory, transport protocols should be independent of the technology of the underlying 
network layer. In particular, TCP should not care whether IP is running over fiber or over radio. 
In practice, it does matter because most TCP implementations have been carefully optimized 
based on assumptions that are true for wired networks but that fail for wireless networks. 
Ignoring the properties of wireless transmission can lead to a TCP implementation that is 
logically correct but has horrendous performance. 


The principal problem is the congestion control algorithm. Nearly all TCP implementations 
nowadays assume that timeouts are caused by congestion, not by lost packets. Consequently, 
when a timer goes off, TCP slows down and sends less vigorously (e.g., Jacobson's slow start 
algorithm). The idea behind this approach is to reduce the network load and thus alleviate the 
congestion. 


Unfortunately, wireless transmission links are highly unreliable. They lose packets all the time. 
The proper approach to dealing with lost packets is to send them again, and as quickly as 
possible. Slowing down just makes matters worse. If, say, 20 percent of all packets are lost, 
then when the sender transmits 100 packets/sec, the throughput is 80 packets/sec. If the 
sender slows down to 50 packets/sec, the throughput drops to 40 packets/sec. 


In effect, when a packet is lost on a wired network, the sender should slow down. When one is 
lost on a wireless network, the sender should try harder. When the sender does not know what 
the network is, it is difficult to make the correct decision. 


Frequently, the path from sender to receiver is heterogeneous. The first 1000 km might be 
over a wired network, but the last 1 km might be wireless. Now making the correct decision on 
a timeout is even harder, since it matters where the problem occurred. A solution proposed by 
Bakne and Badrinath (1995), indirect TCP, is to split the TCP connection into two separate 
connections, as shown in Fig. 6-39. The first connection goes from the sender to the base 
station. The second one goes from the base station to the receiver. The base station simply 
copies packets between the connections in both directions. 


Figure 6-39. Splitting a TCP connection into two connections. 


Sender TCP #1 Base 
\ station 


Router Antenna 


The advantage of this scheme is that both connections are now homogeneous. Timeouts on 
the first connection can slow the sender down, whereas timeouts on the second one can speed 
it up. Other parameters can also be tuned separately for the two connections. The 
disadvantage of the scheme is that it violates the semantics of TCP. Since each part of the 
connection is a full TCP connection, the base station acknowledges each TCP segment in the 
usual way. Only now, receipt of an acknowledgement by the sender does not mean that the 
receiver got the segment, only that the base station got it. 


A different solution, due to Balakrishnan et al. (1995), does not break the semantics of TCP. It 
works by making several small modifications to the network layer code in the base station. 
One of the changes is the addition of a snooping agent that observes and caches TCP 


segments going out to the mobile host and acknowledgements coming back from it. When the 
snooping agent sees a TCP segment going out to the mobile host but does not see an 
acknowledgement coming back before its (relatively short) timer goes off, it just retransmits 
that segment, without telling the source that it is doing so. It also retransmits when it sees 
duplicate acknowledgements from the mobile host go by, invariably meaning that the mobile 
host has missed something. Duplicate acknowledgements are discarded on the spot, to avoid 
having the source misinterpret them as congestion. 


One disadvantage of this transparency, however, is that if the wireless link is very lossy, the 
source may time out waiting for an acknowledgement and invoke the congestion control 
algorithm. With indirect TCP, the congestion control algorithm will never be started unless 
there really is congestion in the wired part of the network. 


The Balakrishnan et al. paper also has a solution to the problem of lost segments originating at 
the mobile host. When the base station notices a gap in the inbound sequence numbers, it 
generates a request for a selective repeat of the missing bytes by using a TCP option. 


Using these fixes, the wireless link is made more reliable in both directions, without the source 
knowing about it and without changing the TCP semantics. 


While UDP does not suffer from the same problems as TCP, wireless communication also 
introduces difficulties for it. The main trouble is that programs use UDP expecting it to be 
highly reliable. They know that no guarantees are given, but they still expect it to be near 
perfect. In a wireless environment, UDP will be far from perfect. For programs that can recover 
from lost UDP messages but only at considerable cost, suddenly going from an environment 
where messages theoretically can be lost but rarely are, to one in which they are constantly 
being lost can result in a performance disaster. 


Wireless communication also affects areas other than just performance. For example, how 
does a mobile host find a local printer to connect to, rather than use its home printer? 
Somewhat related to this is how to get the WWW page for the local cell, even if its name is not 
known. Also, WWW page designers tend to assume lots of bandwidth is available. Putting a 
large logo on every page becomes counterproductive if it is going to take 10 sec to transmit 
over a slow wireless link every time the page is referenced, irritating the users no end. 


As wireless networking becomes more common, the problems of running TCP over it become 
more acute. Additional work in this area is reported in (Barakat et al., 2000; Ghani and Dixit, 
1999; Huston, 2001; and Xylomenos et al., 2001). 


6.3.12 Transactional TCP 


Earlier in this chapter we looked at remote procedure call as a way to implement client-server 
systems. If both the request and reply are small enough to fit into single packets and the 
operation is idempotent, UDP can simply be used, However, if these conditions are not met, 
using UDP is less attractive. For example, if the reply can be quite large, then the pieces must 
be sequenced and a mechanism must be devised to retransmit lost pieces. In effect, the 
application is required to reinvent TCP. 


Clearly, that is unattractive, but using TCP itself is also unattractive. The problem is the 
efficiency. The normal sequence of packets for doing an RPC over TCP is shown in Fig. 6-40(a). 
Nine packets are required in the best case. 


Figure 6-40. (a) RPC using normal TCP. (b) RPC using T/TCP. 
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The nine packets are as follows: 


1. The client sends a SYN packet to establish a connection. 

2. The server sends an ACK packet to acknowledge the SYN packet. 
3. The client completes the three-way handshake. 

4. The client sends the actual request. 

5. The client sends a FIN packet to indicate that it is done sending. 
6. The server acknowledges the request and the FIN. 

7. The server sends the reply back to the client. 

8. The server sends a FIN packet to indicate that it is also done. 

9. The client acknowledges the server's FIN. 


Note that this is the best case. In the worst case, the client's request and FIN are 
acknowledged separately, as are the server's reply and FIN. 


The question quickly arises of whether there is some way to combine the efficiency of RPC 
using UDP (just two messages) with the reliability of TCP. The answer is: Almost. It can be 
done with an experimental TCP variant called T/TCP (Transactional TCP), which is described 
in RFCs 1379 and 1644. 


The central idea here is to modify the standard connection setup sequence slightly to allow the 
transfer of data during setup. The T/TCP protocol is illustrated in Fig. 6-40(b). The client's first 
packet contains the SYN bit, the request itself, and the FIN. In effect it says: I want to 
establish a connection, here is the data, and I am done. 


When the server gets the request, it looks up or computes the reply, and chooses how to 
respond. If the reply fits in one packet, it gives the reply of Fig. 6-40(b), which says: I 
acknowledge your FIN, here is the answer, and I am done. The client then acknowledges the 
server's FIN and the protocol terminates in three messages. 


However, if the result is larger than 1 packet, the server also has the option of not turning on 
the FIN bit, in which case it can send multiple packets before closing its direction. 


It is probably worth mentioning that T/TCP is not the only proposed improvement to TCP. 
Another proposal is SCTP (Stream Control Transmission Protocol). Its features include 
message boundary preservation, multiple delivery modes (e.g., unordered delivery), 
multihoming (backup destinations), and selective acknowledgements (Stewart and Metz, 
2001). However, whenever someone proposes changing something that has worked so well for 
so long, there is always a huge battle between the "Users are demanding more features" and 


"Tf it ain't broken, don't fix it" camps. 


6.4 Performance Issues 


Performance issues are very important in computer networks. When hundreds or thousands of 
computers are interconnected, complex interactions, with unforeseen consequences, are 
common. Frequently, this complexity leads to poor performance and no one knows why. In the 
following sections, we will examine many issues related to network performance to see what 
kinds of problems exist and what can be done about them. 


Unfortunately, understanding network performance is more an art than a science. There is 
little underlying theory that is actually of any use in practice. The best we can do is give rules 
of thumb gained from hard experience and present examples taken from the real world. We 
have intentionally delayed this discussion until we studied the transport layer in TCP in order to 
be able to use TCP as an example in various places. 


The transport layer is not the only place performance issues arise. We saw some of them in 
the network layer in the previous chapter. Nevertheless, the network layer tends to be largely 
concerned with routing and congestion control. The broader, system-oriented issues tend to be 
transport related, so this chapter is an appropriate place to examine them. 


In the next five sections, we will look at five aspects of network performance: 


Performance problems. 

Measuring network performance. 

System design for better performance. 

Fast TPDU processing. 

Protocols for future high-performance networks. 


OF es 


As an aside, we need a generic name for the units exchanged by transport entities. The TCP 
term, segment, is confusing at best and is never used outside the TCP world in this context. 
The ATM terms (CS-PDU, SAR-PDU, and CPCS-PDU) are specific to ATM. Packets clearly refer 
to the network layer, and messages belong to the application layer. For lack of a standard 
term, we will go back to calling the units exchanged by transport entities TPDUs. When we 
mean both TPDU and packet together, we will use packet as the collective term, as in "The 
CPU must be fast enough to process incoming packets in real time." By this we mean both the 
network layer packet and the TPDU encapsulated in it. 


6.4.1 Performance Problems in Computer Networks 


Some performance problems, such as congestion, are caused by temporary resource 
overloads. If more traffic suddenly arrives at a router than the router can handle, congestion 
will build up and performance will suffer. We studied congestion in detail in the previous 
chapter. 


Performance also degrades when there is a structural resource imbalance. For example, if a 
gigabit communication line is attached to a low-end PC, the poor CPU will not be able to 
process the incoming packets fast enough and some will be lost. These packets will eventually 
be retransmitted, adding delay, wasting bandwidth, and generally reducing performance. 


Overloads can also be synchronously triggered. For example, if a TPDU contains a bad 
parameter (e.g., the port for which it is destined), in many cases the receiver will thoughtfully 
send back an error notification. Now consider what could happen if a bad TPDU is broadcast to 
10,000 machines: each one might send back an error message. The resulting broadcast 
storm could cripple the network. UDP suffered from this problem until the protocol was 
changed to cause hosts to refrain from responding to errors in UDP TPDUs sent to broadcast 
addresses. 


A second example of synchronous overload is what happens after an electrical power failure. 
When the power comes back on, all the machines simultaneously jump to their ROMs to start 
rebooting. A typical reboot sequence might require first going to some (DHCP) server to learn 
one's true identity, and then to some file server to get a copy of the operating system. If 
hundreds of machines all do this at once, the server will probably collapse under the load. 


Even in the absence of synchronous overloads and the presence of sufficient resources, poor 
performance can occur due to lack of system tuning. For example, if a machine has plenty of 
CPU power and memory but not enough of the memory has been allocated for buffer space, 
overruns will occur and TPDUS will be lost. Similarly, if the scheduling algorithm does not give 
a high enough priority to processing incoming TPDUs, some of them may be lost. 


Another tuning issue is setting timeouts correctly. When a TPDU is sent, a timer is typically set 
to guard against loss of the TPDU. If the timeout is set too short, unnecessary retransmissions 


will occur, clogging the wires. If the timeout is set too long, unnecessary delays will occur after 
a TPDU is lost. Other tunable parameters include how long to wait for data on which to 
piggyback before sending a separate acknowledgement, and how many retransmissions before 
giving up. 


Gigabit networks bring with them new performance problems. Consider, for example, sending 
a 64-KB burst of data from San Diego to Boston in order to fill the receiver's 64-KB buffer. 
Suppose that the link is 1 Gbps and the one-way speed-of-light-in-fiber delay is 20 msec. 
Initially, at t = 0, the pipe is empty, as illustrated in Fig. 6-41(a). Only 500 usec later, in Fig. 
6-41(b), all the TPDUs are out on the fiber. The lead TPDU will now be somewhere in the 
vicinity of Brawley, still deep in Southern California. However, the transmitter must stop until it 
gets a window update. 


Figure 6-41. The state of transmitting one megabit from San Diego to 
Boston. (a) Att = O. (b) After 500 psec. (c) After 20 msec. (d) After 40 
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After 20 msec, the lead TPDU hits Boston, as shown in Fig. 6-41(c) and is acknowledged. 
Finally, 40 msec after starting, the first acknowledgement gets back to the sender and the 
second burst can be transmitted. Since the transmission line was used for 0.5 msec out of 40, 
the efficiency is about 1.25 percent. This situation is typical of older protocols running over 


gigabit lines. 


A useful quantity to keep in mind when analyzing network performance is the bandwidth- 
delay product. It is obtained by multiplying the bandwidth (in bits/sec) by the round-trip 
delay time (in sec). The product is the capacity of the pipe from the sender to the receiver and 
back (in bits). 


For the example of Fig. 6-41 the bandwidth-delay product is 40 million bits. In other words, 
the sender would have to transmit a burst of 40 million bits to be able to keep going full speed 
until the first acknowledgement came back. It takes this many bits to fill the pipe (in both 
directions). This is why a burst of half a million bits only achieves a 1.25 percent efficiency: it 
is only 1.25 percent of the pipe's capacity. 


The conclusion that can be drawn here is that for good performance, the receiver's window 
must be at least as large as the bandwidth-delay product, preferably somewhat larger since 
the receiver may not respond instantly. For a transcontinental gigabit line, at least 5 
megabytes are required. 


If the efficiency is terrible for sending a megabit, imagine what it is like for a short request of a 
few hundred bytes. Unless some other use can be found for the line while the first client is 
waiting for its reply, a gigabit line is no better than a megabit line, just more expensive. 


Another performance problem that occurs with time-critical applications like audio and video is 
jitter. Having a short mean transmission time is not enough. A small standard deviation is also 
required. Achieving a short mean transmission time along with a small standard deviation 
demands a serious engineering effort. 


6.4.2 Network Performance Measurement 


When a network performs poorly, its users often complain to the folks running it, demanding 
improvements. To improve the performance, the operators must first determine exactly what 
is going on. To find out what is really happening, the operators must make measurements. In 
this section we will look at network performance measurements. The discussion below is based 
on the work of Mogul (1993). 


The basic loop used to improve network performance contains the following steps: 


1. Measure the relevant network parameters and performance. 
2. Try to understand what is going on. 
3. Change one parameter. 


These steps are repeated until the performance is good enough or it is clear that the last drop 
of improvement has been squeezed out. 


Measurements can be made in many ways and at many locations (both physically and in the 
protocol stack). The most basic kind of measurement is to start a timer when beginning some 
activity and see how long that activity takes. For example, knowing how long it takes for a 
TPDU to be acknowledged is a key measurement. Other measurements are made with 
counters that record how often some event has happened (e.g., number of lost TPDUs). 
Finally, one is often interested in knowing the amount of something, such as the number of 
bytes processed in a certain time interval. 


Measuring network performance and parameters has many potential pitfalls. Below we list a 
few of them. Any systematic attempt to measure network performance should be careful to 
avoid these. 


Make Sure That the Sample Size Is Large Enough 


Do not measure the time to send one TPDU, but repeat the measurement, say, one million 
times and take the average. Having a large sample will reduce the uncertainty in the measured 
mean and standard deviation. This uncertainty can be computed using standard statistical 
formulas. 


Make Sure That the Samples Are Representative 


Ideally, the whole sequence of one million measurements should be repeated at different times 
of the day and the week to see the effect of different system loads on the measured quantity. 
Measurements of congestion, for example, are of little use if they are made at a moment when 


there is no congestion. Sometimes the results may be counterintuitive at first, such as heavy 
congestion at 10, 11, 1, and 2 o'clock, but no congestion at noon (when all the users are away 
at lunch). 


Be Careful When Using a Coarse-Grained Clock 


Computer clocks work by incrementing some counter at regular intervals. For example, a 
millisecond timer adds 1 to a counter every 1 msec. Using such a timer to measure an event 
that takes less than 1 msec is possible, but requires some care. (Some computers have more 
accurate clocks, of course.) 


To measure the time to send a TPDU, for example, the system clock (say, in milliseconds) 
should be read out when the transport layer code is entered and again when it is exited. If the 
true TPDU send time is 300 usec, the difference between the two readings will be either O or 1, 
both wrong. However, if the measurement is repeated one million times and the total of all 
measurements added up and divided by one million, the mean time will be accurate to better 
than 1 usec. 


Be Sure That Nothing Unexpected Is Going On during Your Tests 


Making measurements on a university system the day some major lab project has to be turned 
in may give different results than if made the next day. Likewise, if some researcher has 
decided to run a video conference over your network during your tests, you may get a biased 
result. It is best to run tests on an idle system and create the entire workload yourself. Even 
this approach has pitfalls though. While you might think nobody will be using the network at 3 
A.M., that might be precisely when the automatic backup program begins copying all the disks 
to tape. Furthermore, there might be heavy traffic for your wonderful World Wide Web pages 
from distant time zones. 


Caching Can Wreak Havoc with Measurements 


The obvious way to measure file transfer times is to open a large file, read the whole thing, 
close it, and see how long it takes. Then repeat the measurement many more times to get a 
good average. The trouble is, the system may cache the file, so only the first measurement 
actually involves network traffic. The rest are just reads from the local cache. The results from 
such a measurement are essentially worthless (unless you want to measure cache 
performance). 


Often you can get around caching by simply overflowing the cache. For example, if the cache is 
10 MB, the test loop could open, read, and close two 10-MB files on each pass, in an attempt 
to force the cache hit rate to O. Still, caution is advised unless you are absolutely sure you 
understand the caching algorithm. 


Buffering can have a similar effect. One popular TCP/IP performance utility program has been 


known to report that UDP can achieve a performance substantially higher than the physical line 
allows. How does this occur? A call to UDP normally returns control as soon as the message 
has been accepted by the kernel and added to the transmission queue. If there is sufficient 
buffer space, timing 1000 UDP calls does not mean that all the data have been sent. Most of 
them may still be in the kernel, but the performance utility thinks they have all been 
transmitted. 


Understand What You Are Measuring 


When you measure the time to read a remote file, your measurements depend on the network, 
the operating systems on both the client and server, the particular hardware interface boards 


used, their drivers, and other factors. If the measurements are done carefully, you will 
ultimately discover the file transfer time for the configuration you are using. If your goal is to 
tune this particular configuration, these measurements are fine. 


However, if you are making similar measurements on three different systems in order to 
choose which network interface board to buy, your results could be thrown off completely by 
the fact that one of the network drivers is truly awful and is only getting 10 percent of the 
performance of the board. 


Be Careful about Extrapolating the Results 


Suppose that you make measurements of something with simulated network loads running 
from O (idle) to 0.4 (40 percent of capacity), as shown by the data points and solid line 
through them in Fig. 6-42. It may be tempting to extrapolate linearly, as shown by the dotted 
line. However, many queueing results involve a factor of 1/(1 - p), where p is the load, so the 
true values may look more like the dashed line, which rises much faster than linearly. 


Figure 6-42. Response as a function of load. 
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6.4.3 System Design for Better Performance 


Measuring and tinkering can often improve performance considerably, but they cannot 
substitute for good design in the first place. A poorly-designed network can be improved only 
so much. Beyond that, it has to be redesigned from scratch. 


In this section, we will present some rules of thumb based on hard experience with many 
networks. These rules relate to system design, not just network design, since the software and 
operating system are often more important than the routers and interface boards. Most of 


these ideas have been common knowledge to network designers for years and have been 
passed on from generation to generation by word of mouth. They were first stated explicitly by 
Mogul (1993); our treatment largely follows his. Another relevant source is (Metcalfe, 1993). 


Rule #1: CPU Speed Is More Important Than Network Speed 


Long experience has shown that in nearly all networks, operating system and protocol 
overhead dominate actual time on the wire. For example, in theory, the minimum RPC time on 
an Ethernet is 102 usec, corresponding to a minimum (64-byte) request followed by a 


minimum (64-byte) reply. In practice, overcoming the software overhead and getting the RPC 
time anywhere near there is a substantial achievement. 


Similarly, the biggest problem in running at 1 Gbps is getting the bits from the user's buffer 
out onto the fiber fast enough and having the receiving CPU process them as fast as they come 
in. In short, if you double the CPU speed, you often can come close to doubling the 
throughput. Doubling the network capacity often has no effect since the bottleneck is generally 
in the hosts. 


Rule #2: Reduce Packet Count to Reduce Software Overhead 


Processing a TPDU has a certain amount of overhead per TPDU (e.g., header processing) and a 
certain amount of processing per byte (e.g., doing the checksum). When 1 million bytes are 
being sent, the per-byte overhead is the same no matter what the TPDU size is. However, 
using 128-byte TPDUs means 32 times as much per-TPDU overhead as using 4-KB TPDUs. This 
overhead adds up fast. 


In addition to the TPDU overhead, there is overhead in the lower layers to consider. Each 
arriving packet causes an interrupt. On a modern pipelined processor, each interrupt breaks 
the CPU pipeline, interferes with the cache, requires a change to the memory management 
context, and forces a substantial number of CPU registers to be saved. An n-fold reduction in 
TPDUS sent thus reduces the interrupt and packet overhead by a factor of n. 


This observation argues for collecting a substantial amount of data before transmission in 
order to reduce interrupts at the other side. Nagle's algorithm and Clark's solution to the silly 
window syndrome are attempts to do precisely this. 


Rule #3: Minimize Context Switches 


Context switches (e.g., from kernel mode to user mode) are deadly. They have the same bad 
properties as interrupts, the worst being a long series of initial cache misses. Context switches 
can be reduced by having the library procedure that sends data do internal buffering until it 
has a substantial amount of them. Similarly, on the receiving side, small incoming TPDUs 
should be collected together and passed to the user in one fell swoop instead of individually, to 
minimize context switches. 


In the best case, an incoming packet causes a context switch from the current user to the 
kernel, and then a switch to the receiving process to give it the newly-arrived data. 
Unfortunately, with many operating systems, additional context switches happen. For example, 
if the network manager runs as a special process in user space, a packet arrival is likely to 
cause a context switch from the current user to the kernel, then another one from the kernel 
to the network manager, followed by another one back to the kernel, and finally one from the 
kernel to the receiving process. This sequence is shown in Fig. 6-43. All these context switches 
on each packet are very wasteful of CPU time and will have a devastating effect on network 
performance. 


Figure 6-43. Four context switches to handle one packet with a user- 
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Rule #4: Minimize Copying 


Even worse than multiple context switches are multiple copies. It is not unusual for an 
incoming packet to be copied three or four times before the TPDU enclosed in it is delivered. 
After a packet is received by the network interface in a special on-board hardware buffer, it is 
typically copied to a kernel buffer. From there it is copied to a network layer buffer, then to a 
transport layer buffer, and finally to the receiving application process. 


A clever operating system will copy a word at a time, but it is not unusual to require about five 
instructions per word (a load, a store, incrementing an index register, a test for end-of-data, 
and a conditional branch). Making three copies of each packet at five instructions per 32-bit 
word copied requires 15/4 or about four instructions per byte copied. On a 500-MIPS CPU, an 
instruction takes 2 nsec so each byte needs 8 nsec of processing time or about 1 nsec per bit, 
giving a maximum rate of about 1 Gbps. When overhead for header processing, interrupt 
handling, and context switches is factored in, 500 Mbps might be achievable, and we have not 
even considered the actual processing of the data. Clearly, handling a 10-Gbps Ethernet 
running at full blast is out of the question. 


In fact, probably a 500-Mbps line cannot be handled at full speed either. In the computation 
above, we have assumed that a 500-MIPS machine can execute any 500 million 
instructions/sec. In reality, machines can only run at such speeds if they are not referencing 
memory. Memory operations are often a factor of ten slower than register-register instructions 
(i.e., 20 nsec/instruction). If 20 percent of the instructions actually reference memory (i.e., 
are cache misses), which is likely when touching incoming packets, the average instruction 
execution time is 5.6 nsec (0.8 x 2 + 0.2 x 20). With four instructions/byte, we need 22.4 
nsec/byte, or 2.8 nsec/bit), which gives about 357 Mbps. Factoring in 50 percent overhead 
gives us 178 Mbps. Note that hardware assistance will not help here. The problem is too much 
copying by the operating system. 


Rule #5: You Can Buy More Bandwidth but Not Lower Delay 


The next three rules deal with communication, rather than protocol processing. The first rule 
states that if you want more bandwidth, you can just buy it. Putting a second fiber next to the 
first one doubles the bandwidth but does nothing to reduce the delay. Making the delay shorter 
requires improving the protocol software, the operating system, or the network interface. Even 
if all of these improvements are made, the delay will not be reduced if the bottleneck is the 
transmission time. 


Rule #6: Avoiding Congestion Is Better Than Recovering from It 


The old maxim that an ounce of prevention is worth a pound of cure certainly holds for 
network congestion. When a network is congested, packets are lost, bandwidth is wasted, 
useless delays are introduced, and more. Recovering from congestion takes time and patience. 
Not having it occur in the first place is better. Congestion avoidance is like getting your DTP 
vaccination: it hurts a little at the time you get it, but it prevents something that would hurt a 
lot more in the future. 


Rule #7: Avoid Timeouts 


Timers are necessary in networks, but they should be used sparingly and timeouts should be 
minimized. When a timer goes off, some action is generally repeated. If it is truly necessary to 
repeat the action, so be it, but repeating it unnecessarily is wasteful. 


The way to avoid extra work is to be careful that timers are set a little bit on the conservative 
side. A timer that takes too long to expire adds a small amount of extra delay to one 
connection in the (unlikely) event of a TPDU being lost. A timer that goes off when it should 
not have uses up scarce CPU time, wastes bandwidth, and puts extra load on perhaps dozens 
of routers for no good reason. 


6.4.4 Fast TPDU Processing 


The moral of the story above is that the main obstacle to fast networking is protocol software. 
In this section we will look at some ways to speed up this software. For more information, see 
(Clark et al., 1989; and Chase et al., 2001). 


TPDU processing overhead has two components: overhead per TPDU and overhead per byte. 
Both must be attacked. The key to fast TPDU processing is to separate out the normal case 
(one-way data transfer) and handle it specially. Although a sequence of special TPDUs is 
needed to get into the ESTABLISHED state, once there, TPDU processing is straightforward 
until one side starts to close the connection. 


Let us begin by examining the sending side in the ESTABLISHED state when there are data to 
be transmitted. For the sake of clarity, we assume here that the transport entity is in the 
kernel, although the same ideas apply if it is a user-space process or a library inside the 
sending process. In Fig. 6-44, the sending process traps into the kernel to do the SEND. The 
first thing the transport entity does is test to see if this is the normal case: the state is 
ESTABLISHED, neither side is trying to close the connection, a regular (i.e., not an out-of- 
band) full TPDU is being sent, and enough window space is available at the receiver. If all 
conditions are met, no further tests are needed and the fast path through the sending 
transport entity can be taken. Typically, this path is taken most of the time. 


Figure 6-44. The fast path from sender to receiver is shown with a 
heavy line. The processing steps on this path are shaded. 
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In the usual case, the headers of consecutive data TPDUs are almost the same. To take 
advantage of this fact, a prototype header is stored within the transport entity. At the start of 
the fast path, it is copied as fast as possible to a scratch buffer, word by word. Those fields 
that change from TPDU to TPDU are then overwritten in the buffer. Frequently, these fields are 


easily derived from state variables, such as the next sequence number. A pointer to the full 
TPDU header plus a pointer to the user data are then passed to the network layer. Here the 
same strategy can be followed (not shown in Fig. 6-44). Finally, the network layer gives the 
resulting packet to the data link layer for transmission. 


As an example of how this principle works in practice, let us consider TCP/IP. Fig. 6-45(a) 
shows the TCP header. The fields that are the same between consecutive TPDUs on a one-way 
flow are shaded. All the sending transport entity has to do is copy the five words from the 
prototype header into the output buffer, fill in the next sequence number (by copying it from a 
word in memory), compute the checksum, and increment the sequence number in memory. It 
can then hand the header and data to a special IP procedure for sending a regular, maximum 
TPDU. IP then copies its five-word prototype header [see Fig. 6-45(b)] into the buffer, fills in 
the Identification field, and computes its checksum. The packet is now ready for transmission. 


Figure 6-45. (a) TCP header. (b) IP header. In both cases, the shaded 
fields are taken from the prototype without change. 
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Now let us look at fast path processing on the receiving side of Fig. 6-44. Step 1 is locating the 
connection record for the incoming TPDU. For TCP, the connection record can be stored in a 
hash table for which some simple function of the two IP addresses and two ports is the key. 
Once the connection record has been located, both addresses and both ports must be 
compared to verify that the correct record has been found. 


An optimization that often speeds up connection record lookup even more is to maintain a 
pointer to the last one used and try that one first. Clark et al. (1989) tried this and observed a 
hit rate exceeding 90 percent. Other lookup heuristics are described in (McKenney and Dove, 
1992). 


The TPDU is then checked to see if it is a normal one: the state is ESTABLISHED, neither side 
is trying to close the connection, the TPDU is a full one, no special flags are set, and the 
sequence number is the one expected. These tests take just a handful of instructions. If all 
conditions are met, a special fast path TCP procedure is called. 


The fast path updates the connection record and copies the data to the user. While it is 
copying, it also computes the checksum, eliminating an extra pass over the data. If the 
checksum is correct, the connection record is updated and an acknowledgement is sent back. 
The general scheme of first making a quick check to see if the header is what is expected and 
then having a special procedure handle that case is called header prediction. Many TCP 
implementations use it. When this optimization and all the other ones discussed in this chapter 
are used together, it is possible to get TCP to run at 90 percent of the speed of a local 
memory-to-memory copy, assuming the network itself is fast enough. 


Two other areas where major performance gains are possible are buffer management and 
timer management. The issue in buffer management is avoiding unnecessary copying, as 
mentioned above. Timer management is important because nearly all timers set do not expire. 
They are set to guard against TPDU loss, but most TPDUS arrive correctly and their 


acknowledgements also arrive correctly. Hence, it is important to optimize timer management 


for the case of timers rarely expiring. 


A common scheme is to use a linked list of timer events sorted by expiration time. The head 
entry contains a counter telling how many ticks away from expiry it is. Each successive entry 
contains a counter telling how many ticks after the previous entry it is. Thus, if timers expire in 
3, 10, and 12 ticks, respectively, the three counters are 3, 7, and 2, respectively. 


At every clock tick, the counter in the head entry is decremented. When it hits zero, its event 
is processed and the next item on the list becomes the head. Its counter does not have to be 
changed. In this scheme, inserting and deleting timers are expensive operations, with 
execution times proportional to the length of the list. 


A more efficient approach can be used if the maximum timer interval is bounded and known in 
advance. Here an array, called a timing wheel, can be used, as shown in Fig. 6-46. Each slot 
corresponds to one clock tick. The current time shown is 7 = 4. Timers are scheduled to expire 
at 3, 10, and 12 ticks from now. If a new timer suddenly is set to expire in seven ticks, an 
entry is just made in slot 11. Similarly, if the timer set for 7 + 10 has to be canceled, the list 
starting in slot 14 has to be searched and the required entry removed. Note that the array of 
Fig. 6-46 cannot accommodate timers beyond 7 + 15. 


Figure 6-46. A timing wheel. 
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When the clock ticks, the current time pointer is advanced by one slot (circularly). If the entry 
now pointed to is nonzero, all of its timers are processed. Many variations on the basic idea are 
discussed in (Varghese and Lauck, 1987). 


6.4.5 Protocols for Gigabit Networks 


At the start of the 1990s, gigabit networks began to appear. People's first reaction was to use 
the old protocols on them, but various problems quickly arose. In this section we will discuss 
some of these problems and the directions new protocols are taking to solve them as we move 
toward ever faster networks. 


The first problem is that many protocols use 32-bit sequence numbers. When the Internet 
began, the lines between routers were mostly 56-kbps leased lines, so a host blasting away at 
full speed took over 1 week to cycle through the sequence numbers. To the TCP designers, 2?? 
was a pretty decent approximation of infinity because there was little danger of old packets 


still being around a week after they were transmitted. With 10-Mbps Ethernet, the wrap time 
became 57 minutes, much shorter, but still manageable. With a 1-Gbps Ethernet pouring data 


out onto the Internet, the wrap time is about 34 seconds, well under the 120 sec maximum 
packet lifetime on the Internet. All of a sudden, 2?? is not nearly as good an approximation to 
infinity since a sender can cycle through the sequence space while old packets still exist. RFC 
1323 provides an escape hatch, though. 


The problem is that many protocol designers simply assumed, without stating it, that the time 
to use up the entire sequence space would greatly exceed the maximum packet lifetime. 
Consequently, there was no need to even worry about the problem of old duplicates still 
existing when the sequence numbers wrapped around. At gigabit speeds, that unstated 
assumption fails. 


A second problem is that communication speeds have improved much faster than computing 
speeds. (Note to computer engineers: Go out and beat those communication engineers! We 
are counting on you.) In the 1970s, the ARPANET ran at 56 kbps and had computers that ran 
at about 1 MIPS. Packets were 1008 bits, so the ARPANET was capable of delivering about 56 
packets/sec. With almost 18 msec available per packet, a host could afford to spend 18,000 
instructions processing a packet. Of course, doing so would soak up the entire CPU, but it 
could devote 9000 instructions per packet and still have half the CPU left to do real work. 


Compare these numbers to 1000-MIPS computers exchanging 1500-byte packets over a 
gigabit line. Packets can flow in at a rate of over 80,000 per second, so packet processing 
must be completed in 6.25 usec if we want to reserve half the CPU for applications. In 6.25 
usec, a 1000-MIPS computer can execute 6250 instructions, only 1/3 of what the ARPANET 
hosts had available. Furthermore, modern RISC instructions do less per instruction than the 
old CISC instructions did, so the problem is even worse than it appears. The conclusion is this: 
there is less time available for protocol processing than there used to be, so protocols must 
become simpler. 


A third problem is that the go back n protocol performs poorly on lines with a large bandwidth- 
delay product. Consider, for example, a 4000-km line operating at 1 Gbps. The round-trip 
transmission time is 40 msec, in which time a sender can transmit 5 megabytes. If an error is 
detected, it will be 40 msec before the sender is told about it. If go back n is used, the sender 
will have to retransmit not just the bad packet, but also the 5 megabytes worth of packets that 
came afterward. Clearly, this is a massive waste of resources. 


A fourth problem is that gigabit lines are fundamentally different from megabit lines in that 
long gigabit lines are delay limited rather than bandwidth limited. In Fig. 6-47 we show the 
time it takes to transfer a 1-megabit file 4000 km at various transmission speeds. At speeds 
up to 1 Mbps, the transmission time is dominated by the rate at which the bits can be sent. By 
1 Gbps, the 40-msec roundtrip delay dominates the 1 msec it takes to put the bits on the 
fiber. Further increases in bandwidth have hardly any effect at all. 


Figure 6-47. Time to transfer and acknowledge a 1-megabit file over a 
4000-km line. 
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Figure 6-47 has unfortunate implications for network protocols. It says that stop-and-wait 
protocols, such as RPC, have an inherent upper bound on their performance. This limit is 
dictated by the speed of light. No amount of technological progress in optics will ever improve 
matters (new laws of physics would help, though). 


A fifth problem that is worth mentioning is not a technological or protocol one like the others, 
but a result of new applications. Simply stated, it is that for many gigabit applications, such as 
multimedia, the variance in the packet arrival times is as important as the mean delay itself. A 
slow-but-uniform delivery rate is often preferable to a fast-but-jumpy one. 


Let us now turn from the problems to ways of dealing with them. We will first make some 
general remarks, then look at protocol mechanisms, packet layout, and protocol software. 


The basic principle that all gigabit network designers should learn by heart is: 
Design for speed, not for bandwidth optimization. 


Old protocols were often designed to minimize the number of bits on the wire, frequently by 
using small fields and packing them together into bytes and words. Nowadays, there is plenty 
of bandwidth. Protocol processing is the problem, so protocols should be designed to minimize 
it. The IPv6 designers clearly understood this principle. 


A tempting way to go fast is to build fast network interfaces in hardware. The difficulty with 
this strategy is that unless the protocol is exceedingly simple, hardware just means a plug-in 
board with a second CPU and its own program. To make sure the network coprocessor is 
cheaper than the main CPU, it is often a slower chip. The consequence of this design is that 
much of the time the main (fast) CPU is idle waiting for the second (slow) CPU to do the critical 
work. It is a myth to think that the main CPU has other work to do while waiting. Furthermore, 
when two general-purpose CPUs communicate, race conditions can occur, so elaborate 
protocols are needed between the two processors to synchronize them correctly. Usually, the 
best approach is to make the protocols simple and have the main CPU do the work. 


Let us now look at the issue of feedback in high-speed protocols. Due to the (relatively) long 
delay loop, feedback should be avoided: it takes too long for the receiver to signal the sender. 
One example of feedback is governing the transmission rate by using a sliding window 
protocol. To avoid the (long) delays inherent in the receiver sending window updates to the 
sender, it is better to use a rate-based protocol. In such a protocol, the sender can send all it 
wants to, provided it does not send faster than some rate the sender and receiver have agreed 
upon in advance. 


A second example of feedback is Jacobson's slow start algorithm. This algorithm makes 
multiple probes to see how much the network can handle. With high-speed networks, making 
half a dozen or so small probes to see how the network responds wastes a huge amount of 
bandwidth. A more efficient scheme is to have the sender, receiver, and network all reserve 
the necessary resources at connection setup time. Reserving resources in advance also has the 
advantage of making it easier to reduce jitter. In short, going to high speeds inexorably 
pushes the design toward connection-oriented operation, or something fairly close to it. Of 
course, if bandwidth becomes so plentiful in the future that nobody cares about wasting lots of 
it, the design rules will become very different. 


Packet layout is an important consideration in gigabit networks. The header should contain as 
few fields as possible, to reduce processing time, and these fields should be big enough to do 
the job and be word aligned for ease of processing. In this context, "big enough" means that 
problems such as sequence numbers wrapping around while old packets still exist, receivers 
being unable to advertise enough window space because the window field is too small, and so 
on do not occur. 


The header and data should be separately checksummed, for two reasons. First, to make it 


possible to checksum the header but not the data. Second, to verify that the header is correct 
before copying the data into user space. It is desirable to do the data checksum at the time 
the data are copied to user space, but if the header is incorrect, the copy may go to the wrong 
process. To avoid an incorrect copy but to allow the data checksum to be done during copying, 
it is essential that the two checksums be separate. 


The maximum data size should be large, to permit efficient operation even in the face of long 
delays. Also, the larger the data block, the smaller the fraction of the total bandwidth devoted 
to headers. 1500 bytes is too small. 


Another valuable feature is the ability to send a normal amount of data along with the 
connection request. In this way, one round-trip time can be saved. 


Finally, a few words about the protocol software are appropriate. A key thought is 
concentrating on the successful case. Many older protocols tend to emphasize what to do when 
something goes wrong (e.g., a packet getting lost). To make the protocols run fast, the 
designer should aim for minimizing processing time when everything goes right. Minimizing 
processing time when an error occurs is secondary. 


A second software issue is minimizing copying time. As we saw earlier, copying data is often 
the main source of overhead. Ideally, the hardware should dump each incoming packet into 
memory as a contiguous block of data. The software should then copy this packet to the user 
buffer with a single block copy. Depending on how the cache works, it may even be desirable 
to avoid a copy loop. In other words, to copy 1024 words, the fastest way may be to have 
1024 back-to-back move instructions (or 1024 load-store pairs). The copy routine is so critical 
it should be carefully handcrafted in assembly code, unless there is a way to trick the compiler 
into producing precisely the optimal code. 


6.5 Summary 


The transport layer is the key to understanding layered protocols. It provides various services, 
the most important of which is an end-to-end, reliable, connection-oriented byte stream from 
sender to receiver. It is accessed through service primitives that permit the establishment, 
use, and release of connections. A common transport layer interface is the one provided by 
Berkeley sockets. 


Transport protocols must be able to do connection management over unreliable networks. 
Connection establishment is complicated by the existence of delayed duplicate packets that 


can reappear at inopportune moments. To deal with them, three-way handshakes are needed 
to establish connections. Releasing a connection is easier than establishing one but is still far 
from trivial due to the two-army problem. 


Even when the network layer is completely reliable, the transport layer has plenty of work to 
do. It must handle all the service primitives, manage connections and timers, and allocate and 
utilize credits. 


The Internet has two main transport protocols: UDP and TCP. UDP is a connectionless protocol 
that is mainly a wrapper for IP packets with the additional feature of multiplexing and 
demultiplexing multiple processes using a single IP address. UDP can be used for client-server 
interactions, for example, using RPC. It can also be used for building real-time protocols such 
as RTP. 


The main Internet transport protocol is TCP. It provides a reliable bidirectional byte stream. It 
uses a 20-byte header on all segments. Segments can be fragmented by routers within the 
Internet, so hosts must be prepared to do reassembly. A great deal of work has gone into 
optimizing TCP performance, using algorithms from Nagle, Clark, Jacobson, Karn, and others. 
Wireless links add a variety of complications to TCP. Transactional TCP is an extension to TCP 


that handles client-server interactions with a reduced number of packets. 


Network performance is typically dominated by protocol and TPDU processing overhead, and 
this situation gets worse at higher speeds. Protocols should be designed to minimize the 
number of TPDUs, context switches, and times each TPDU is copied. For gigabit networks, 
simple protocols are called for. 


Chapter 7. The Application Layer 


Having finished all the preliminaries, we now come to the layer where all the applications are 
found. The layers below the application layer are there to provide reliable transport, but they 
do not do real work for users. In this chapter we will study some real network applications. 


However, even in the application layer there is a need for support protocols, to allow the 
applications to function. Accordingly, we will look at one of these before starting with the 
applications themselves. The item in question is DNS, which handles naming within the 
Internet. After that, we will examine three real applications: electronic mail, the World Wide 
Web, and finally, multimedia. 


7.1 DNS—The Domain Name System 


Although programs theoretically could refer to hosts, mailboxes, and other resources by their 
network (e.g., IP) addresses, these addresses are hard for people to remember. Also, sending 
e-mail to tana@128.111.24.41 means that if Tana's ISP or organization moves the mail server 
to a different machine with a different IP address, her e-mail address has to change. 
Consequently, ASCII names were introduced to decouple machine names from machine 
addresses. In this way, Tana's address might be something like tana@art.ucsb.edu. 
Nevertheless, the network itself understands only numerical addresses, so some mechanism is 
required to convert the ASCII strings to network addresses. In the following sections we will 
study how this mapping is accomplished in the Internet. 


Way back in the ARPANET, there was simply a file, hosts.txt, that listed all the hosts and their 
IP addresses. Every night, all the hosts would fetch it from the site at which it was maintained. 
For a network of a few hundred large timesharing machines, this approach worked reasonably 
well. 


However, when thousands of minicomputers and PCs were connected to the net, everyone 
realized that this approach could not continue to work forever. For one thing, the size of the 
file would become too large. However, even more important, host name conflicts would occur 
constantly unless names were centrally managed, something unthinkable in a huge 
international network due to the load and latency. To solve these problems, DNS (the Domain 
Name System) was invented. 


The essence of DNS is the invention of a hierarchical, domain-based naming scheme and a 

distributed database system for implementing this naming scheme. It is primarily used for 

mapping host names and e-mail destinations to IP addresses but can also be used for other 
purposes. DNS is defined in RFCs 1034 and 1035. 


Very briefly, the way DNS is used is as follows. To map a name onto an IP address, an 
application program calls a library procedure called the resolver, passing it the name as a 
parameter. We saw an example of a resolver, gethostbyname, in Fig. 6-6. The resolver sends 
a UDP packet to a local DNS server, which then looks up the name and returns the IP address 
to the resolver, which then returns it to the caller. Armed with the IP address, the program can 
then establish a TCP connection with the destination or send it UDP packets. 


7.1.1 The DNS Name Space 


Managing a large and constantly changing set of names is a nontrivial problem. In the postal 
system, name management is done by requiring letters to specify (implicitly or explicitly) the 
country, state or province, city, and street address of the addressee. By using this kind of 
hierarchical addressing, there is no confusion between the Marvin Anderson on Main St. in 


White Plains, N.Y. and the Marvin Anderson on Main St. in Austin, Texas. DNS works the same 


Way. 


Conceptually, the Internet is divided into over 200 top-level domains, where each domain 
covers many hosts. Each domain is partitioned into subdomains, and these are further 
partitioned, and so on. All these domains can be represented by a tree, as shown in Fig. 7-1. 
The leaves of the tree represent domains that have no subdomains (but do contain machines, 
of course). A leaf domain may contain a single host, or it may represent a company and 
contain thousands of hosts. 


Figure 7-1. A portion of the Internet domain name space. 
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The top-level domains come in two flavors: generic and countries. The original generic 
domains were com (commercial), edu (educational institutions), gov (the U.S. Federal 
Government), int (certain international organizations), mi/ (the U.S. armed forces), net 
(network providers), and org (nonprofit organizations). The country domains include one entry 
for every country, as defined in ISO 3166. 


In November 2000, ICANN approved four new, general-purpose, top-level domains, namely, 
biz (businesses), info (information), name (people's names), and pro (professions, such as 
doctors and lawyers). In addition, three more specialized top-level domains were introduced at 
the request of certain industries. These are aero (aerospace industry), coop (co-operatives), 
and museum (museums). Other top-level domains will be added in the future. 


As an aside, as the Internet becomes more commercial, it also becomes more contentious. 
Take pro, for example. It was intended for certified professionals. But who is a professional? 
And certified by whom? Doctors and lawyers clearly are professionals. But what about 
freelance photographers, piano teachers, magicians, plumbers, barbers, exterminators, tattoo 
artists, mercenaries, and prostitutes? Are these occupations professional and thus eligible for 
pro domains? And if so, who certifies the individual practitioners? 


In general, getting a second-level domain, such as name-of-company.com, is easy. It merely 
requires going to a registrar for the corresponding top-level domain (com in this case) to check 
if the desired name is available and not somebody else's trademark. If there are no problems, 
the requester pays a small annual fee and gets the name. By now, virtually every common 
(English) word has been taken in the com domain. Try household articles, animals, plants, 
body parts, etc. Nearly all are taken. 


Each domain is named by the path upward from it to the (unnamed) root. The components are 
separated by periods (pronounced "dot"). Thus, the engineering department at Sun 
Microsystems might be eng.sun.com., rather than a UNIX-style name such as /com/sun/eng. 
Notice that this hierarchical naming means that eng.sun.com. does not conflict with a potential 
use of eng in eng.yale.edu., which might be used by the Yale English department. 


Domain names can be either absolute or relative. An absolute domain name always ends with 
a period (e.g., eng.sun.com.), whereas a relative one does not. Relative names have to be 
interpreted in some context to uniquely determine their true meaning. In both cases, a named 


domain refers to a specific node in the tree and all the nodes under it. 


Domain names are case insensitive, so edu, Edu, and EDU mean the same thing. Component 
names can be up to 63 characters long, and full path names must not exceed 255 characters. 


In principle, domains can be inserted into the tree in two different ways. For example, 
cs.yale.edu could equally well be listed under the us country domain as cs.yale.ct.us. In 
practice, however, most organizations in the United States are under a generic domain, and 
most outside the United States are under the domain of their country. There is no rule against 
registering under two top-level domains, but few organizations except multinationals do it 
(e.g., sony.com and sony.nl). 


Each domain controls how it allocates the domains under it. For example, Japan has domains 
ac.jp and co.jp that mirror edu and com. The Netherlands does not make this distinction and 
puts all organizations directly under n/. Thus, all three of the following are university computer 
science departments: 


1. cs.yale.edu (Yale University, in the United States) 
2. cs.vu.nl (Vrije Universiteit, in The Netherlands) 
3. cs.keio.ac.jp (Keio University, in Japan) 


To create a new domain, permission is required of the domain in which it will be included. For 
example, if a VLSI group is started at Yale and wants to be known as visi.cs.yale.edu, it has to 
get permission from whoever manages cs.yale.edu. Similarly, if a new university is chartered, 
say, the University of Northern South Dakota, it must ask the manager of the edu domain to 
assign it unsd.edu. In this way, name conflicts are avoided and each domain can keep track of 
all its subdomains. Once a new domain has been created and registered, it can create 
subdomains, such as cs.unsd.edu, without getting permission from anybody higher up the 
tree. 


Naming follows organizational boundaries, not physical networks. For example, if the computer 
science and electrical engineering departments are located in the same building and share the 
same LAN, they can nevertheless have distinct domains. Similarly, even if computer science is 
split over Babbage Hall and Turing Hall, the hosts in both buildings will normally belong to the 
same domain. 


7.1.2 Resource Records 


Every domain, whether it is a single host or a top-level domain, can have a set of resource 
records associated with it. For a single host, the most common resource record is just its IP 
address, but many other kinds of resource records also exist. When a resolver gives a domain 
name to DNS, what it gets back are the resource records associated with that name. Thus, the 
primary function of DNS is to map domain names onto resource records. 


A resource record is a five-tuple. Although they are encoded in binary for efficiency, in most 
expositions, resource records are presented as ASCII text, one line per resource record. The 
format we will use is as follows: 


Domain name Time to live Class Type Value 


The Domain name tells the domain to which this record applies. Normally, many records exist 
for each domain and each copy of the database holds information about multiple domains. This 


field is thus the primary search key used to satisfy queries. The order of the records in the 
database is not significant. 


The Time to live field gives an indication of how stable the record is. Information that is highly 
stable is assigned a large value, such as 86400 (the number of seconds in 1 day). Information 


that is highly volatile is assigned a small value, such as 60 (1 minute). We will come back to 
this point later when we have discussed caching. 


The third field of every resource record is the Class. For Internet information, it is always IN. 
For non-Internet information, other codes can be used, but in practice, these are rarely seen. 


The Type field tells what kind of record this is. The most important types are listed in Fig. 7-2. 


Figure 7-2. . The principal DNS resource record types for IPv4. 


Type — Meaning | Value 
SOA _ Start of Authority _ Parameters for this zone 
A IP address of a host 32-Bit integer 
MX Mail exchange Priority, domain willing to accept e-mail 
NS . Name Server , Name of a server for this domain 
CNAME | Canonical name Domain name 
| PTR | Pointer | Alias for an IP address 
HINFO | Host description _ CPU and OS in ASCII 
TXT Text Uninterpreted ASCII text 


An SOA record provides the name of the primary source of information about the name 
server's zone (described below), the e-mail address of its administrator, a unique serial 
number, and various flags and timeouts. 


The most important record type is the A (Address) record. It holds a 32-bit IP address for 
some host. Every Internet host must have at least one IP address so that other machines can 
communicate with it. Some hosts have two or more network connections, in which case they 
will have one type A resource record per network connection (and thus per IP address). DNS 
can be configured to cycle through these, returning the first record on the first request, the 
second record on the second request, and so on. 


The next most important record type is the MX record. It specifies the name of the host 
prepared to accept e-mail for the specified domain. It is used because not every machine is 
prepared to accept e-mail. If someone wants to send e-mail to, for example, 
bill@microsoft.com, the sending host needs to find a mail server at microsoft.com that is 
willing to accept e-mail. The MX record can provide this information. 


The NS records specify name servers. For example, every DNS database normally has an NS 
record for each of the top-level domains, so, for example, e-mail can be sent to distant parts 
of the naming tree. We will come back to this point later. 


CNAME records allow aliases to be created. For example, a person familiar with Internet 
naming in general and wanting to send a message to someone whose login name is paul in the 
computer science department at M.I.T. might guess that pau/@cs.mit.edu will work. Actually, 
this address will not work, because the domain for M.I.T.'s computer science department is 
Ics.mit. edu. However, as a service to people who do not know this, M.I.T. could create a 
CNAME entry to point people and programs in the right direction. An entry like this one might 
do the job: 


cs.mit.edu 86400 IN CNAME  lcs.mit.edu 


Like CNAME, PTR points to another name. However, unlike CNAME, which is really just a macro 
definition, PTR is a regular DNS datatype whose interpretation depends on the context in which 
it is found. In practice, it is nearly always used to associate a name with an IP address to allow 
lookups of the IP address and return the name of the corresponding machine. These are called 
reverse lookups. 


HINFO records allow people to find out what kind of machine and operating system a domain 
corresponds to. Finally, TXT records allow domains to identify themselves in arbitrary ways. 
Both of these record types are for user convenience. Neither is required, so programs cannot 
count on getting them (and probably cannot deal with them if they do get them). 


Finally, we have the Value field. This field can be a number, a domain name, or an ASCII 
string. The semantics depend on the record type. A short description of the Va/ue fields for 
each of the principal record types is given in Fig. 7-2. 

For an example of the kind of information one might find in the DNS database of a domain, see 


Fig. 7-3. This figure depicts part of a (semihypothetical) database for the cs.vu.nl domain 
shown in Fig. 7-1. The database contains seven types of resource records. 


Figure 7-3. A portion of a possible DNS database for cs.vu.nl 


; Authoritative data for cs.vu.nl 


CS.Vu.nl. 86400 IN SOA star boss (9527,7200,7200,241920,86400) 
CS.Vu.nl. 86400 IN TXT "Divisie Wiskunde en Informatica." 
CS.Vu.nl. 86400 IN TXT "Vrije Universiteit Amsterdam." 
cs.vu.nl. 86400 IN MX 1 zephyr.cs.vu.nl. 

CS.Vu.nl. 86400 IN MX 2 top.cs.vu.nl. 

flits.cs.vu.nl. 86400 IN HINFO Sun Unix 

flits.cs.vu.nl. 86400 IN A 130.37.16.112 

flits.cs.vu.nl. 86400 IN A 192.31.231.165 

flits.cs.vu.nl. 86400 IN MX 1 flits.cs.vu.nl. 

flits.cs.vu.nl. 86400 IN MX 2 zephyr.cs.vu.nl. 

flits.cs.vu.nl. 86400 IN MX 3 top.cs.vu.nl. 


www.cs.vu.nl. 86400 IN CNAME  star.cs.vu.nl 
ftp.cs.vu.nl. 86400 IN CNAME  zephyr.cs.vu.nl 


rowboat IN A 130.37.56.201 
IN MX 1 rowboat 
IN MX 2 zephyr 


IN HINFO Sun Unix 


little-sister IN A 130.37.62.23 
IN HINFO Mac MacOS 


laserjet IN A 192.31.231.216 
IN HINFO q "HP Laserjet IIISi" Proprietary 


The first noncomment line of Fig. 7-3 gives some basic information about the domain, which 
will not concern us further. The next two lines give textual information about where the 
domain is located. Then come two entries giving the first and second places to try to deliver e- 
mail sent to person Qcs.vu.nl. The zephyr (a specific machine) should be tried first. If that 
fails, the top should be tried as the next choice. 


After the blank line, added for readability, come lines telling that the f/its is a Sun workstation 
running UNIX and giving both of its IP addresses. Then three choices are given for handling e- 


mail sent to flits.cs.vu.nl. First choice is naturally the flits itself, but if it is down, the zephyr 
and top are the second and third choices. Next comes an alias, www.cs.vu.nl, so that this 
address can be used without designating a specific machine. Creating this alias allows cs.vu.n/ 
to change its World Wide Web server without invalidating the address people use to get to it. A 
similar argument holds for ftp.cs.vu.nl. 


The next four lines contain a typical entry for a workstation, in this case, rowboat.cs.vu.nl. The 
information provided contains the IP address, the primary and secondary mail drops, and 


information about the machine. Then comes an entry for a non-UNIX system that is not 
capable of receiving mail itself, followed by an entry for a laser printer that is connected to the 
Internet. 


What are not shown (and are not in this file) are the IP addresses used to look up the top-level 
domains. These are needed to look up distant hosts, but since they are not part of the cs.vu.nl 
domain, they are not in this file. They are supplied by the root servers, whose IP addresses are 
present in a system configuration file and loaded into the DNS cache when the DNS server is 
booted. There are about a dozen root servers spread around the world, and each one knows 
the IP addresses of all the top-level domain servers. Thus, if a machine knows the IP address 
of at least one root server, it can look up any DNS name. 


7.1.3 Name Servers 


In theory at least, a single name server could contain the entire DNS database and respond to 
all queries about it. In practice, this server would be so overloaded as to be useless. 
Furthermore, if it ever went down, the entire Internet would be crippled. 


To avoid the problems associated with having only a single source of information, the DNS 
name space is divided into nonoverlapping zones. One possible way to divide the name space 
of Fig. 7-1 is shown in Fig. 7-4. Each zone contains some part of the tree and also contains 
name servers holding the information about that zone. Normally, a zone will have one primary 
name server, which gets its information from a file on its disk, and one or more secondary 
name servers, which get their information from the primary name server. To improve 
reliability, some servers for a zone can be located outside the zone. 


Figure 7-4. Part of the DNS name space showing the division into 
zones. 
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Where the zone boundaries are placed within a zone is up to that zone's administrator. This 
decision is made in large part based on how many name servers are desired, and where. For 
example, in Fig. 7-4, Yale has a server for yale.edu that handles eng.yale.edu but not 
cs.yale.edu, which is a separate zone with its own name servers. Such a decision might be 
made when a department such as English does not wish to run its own name server, but a 


department such as computer science does. Consequently, cs.yale.edu is a separate zone but 
eng.yale.edu is not. 


When a resolver has a query about a domain name, it passes the query to one of the local 
name servers. If the domain being sought falls under the jurisdiction of the name server, such 
as ai.cs.yale.edu falling under cs.yale.edu, it returns the authoritative resource records. An 
authoritative record is one that comes from the authority that manages the record and is 
thus always correct. Authoritative records are in contrast to cached records, which may be out 
of date. 


If, however, the domain is remote and no information about the requested domain is available 
locally, the name server sends a query message to the top-level name server for the domain 
requested. To make this process clearer, consider the example of Fig. 7-5. Here, a resolver on 
flits.cs.vu.nl wants to know the IP address of the host /inda.cs.yale.edu. In step 1, it sends a 
query to the local name server, cs.vu.nl. This query contains the domain name sought, the 
type (A) and the class (IN). 


Figure 7-5. How a resolver looks up a remote name in eight steps. 


VU CS Edu Yale Yale CS 
Originator 1 name server 2 name server 3 name server 4 name server 


flits.cs.vu.nl cs.vu.nl edu-server.net yale.edu cs.yale.edu 


Let us suppose the local name server has never had a query for this domain before and knows 
nothing about it. It may ask a few other nearby name servers, but if none of them know, it 
sends a UDP packet to the server for edu given in its database (see Fig. 7-5), edu-server.net. 
It is unlikely that this server knows the address of /inda.cs.yale.edu, and probably does not 
know cs.yale.edu either, but it must know all of its own children, so it forwards the request to 
the name server for yale.edu (step 3). In turn, this one forwards the request to cs.yale.edu 
(step 4), which must have the authoritative resource records. Since each request is from a 
client to a server, the resource record requested works its way back in steps 5 through 8. 


Once these records get back to the cs.vu.n/ name server, they will be entered into a cache 
there, in case they are needed later. However, this information is not authoritative, since 
changes made at cs. yale.edu will not be propagated to all the caches in the world that may 
know about it. For this reason, cache entries should not live too long. This is the reason that 
the Time to live field is included in each resource record. It tells remote name servers how 
long to cache records. If a certain machine has had the same IP address for years, it may be 
safe to cache that information for 1 day. For more volatile information, it might be safer to 
purge the records after a few seconds or a minute. 


It is worth mentioning that the query method described here is known as a recursive query, 
since each server that does not have the requested information goes and finds it somewhere, 
then reports back. An alternative form is also possible. In this form, when a query cannot be 
satisfied locally, the query fails, but the name of the next server along the line to try is 
returned. Some servers do not implement recursive queries and always return the name of the 
next server to try. 


It is also worth pointing out that when a DNS client fails to get a response before its timer 
goes off, it normally will try another server next time. The assumption here is that the server 
is probably down, rather than that the request or reply got lost. 


While DNS is extremely important to the correct functioning of the Internet, all it really does is 
map symbolic names for machines onto their IP addresses. It does not help locate people, 
resources, services, or objects in general. For locating these things, another directory service 


has been defined, called LDAP (Lightweight Directory Access Protocol). It is a simplified 
version of the OSI X.500 directory service and is described in RFC 2251. It organizes 
information as a tree and allows searches on different components. It can be regarded as a 
"white pages" telephone book. We will not discuss it further in this book, but for more 
information see (Weltman and Dahbura, 2000). 


7.2 Electronic Mail 


Electronic mail, or e-mail, as it is known to its many fans, has been around for over two 
decades. Before 1990, it was mostly used in academia. During the 1990s, it became known to 


the public at large and grew exponentially to the point where the number of e-mails sent per 
day now is vastly more than the number of snail mail (i.e., paper) letters. 


E-mail, like most other forms of communication, has its own conventions and styles. In 
particular, it is very informal and has a low threshold of use. People who would never dream of 
calling up or even writing a letter to a Very Important Person do not hesitate for a second to 
send a sloppily-written e-mail. 


E-mail is full of jargon such as BTW (By The Way), ROTFL (Rolling On The Floor Laughing), and 
IMHO (In My Humble Opinion). Many people also use little ASCII symbols called smileys or 
emoticons in their e-mail. A few of the more interesting ones are reproduced in Fig. 7-6. For 
most, rotating the book 90 degrees clockwise will make them clearer. For a minibook giving 
over 650 smileys, see (Sanderson and Dougherty, 1993). 


Figure 7-6. Some smileys. They will not be on the final exam :-) 


Smiley Meaning , Smiley | Meaning , Smiley | Meaning 
-) I'm happy | zk-) | Abe Lincoln +) _ Big nose 

:-{ | I'm sad/angry zy-) Uncle Sam :-)) Double chin 
i] I'm apathetic C- Santa Claus E Mustache 

Ls) I'm winking | «- | Dunce | #:-) | Matted hair 
:-(O) I'm yelling a | Australian | 8) | Wears glasses 
2) I'm vomiting LX Man with bowtie C:-) Large brain 


The first e-mail systems simply consisted of file transfer protocols, with the convention that 
the first line of each message (i.e., file) contained the recipient's address. As time went on, the 
limitations of this approach became more obvious. 


Some of the complaints were as follows: 


1. Sending a message to a group of people was inconvenient. Managers often need this 
facility to send memos to all their subordinates. 

2. Messages had no internal structure, making computer processing difficult. For example, 
if a forwarded message was included in the body of another message, extracting the 
forwarded part from the received message was difficult. 

3. The originator (sender) never knew if a message arrived or not. 

4. If someone was planning to be away on business for several weeks and wanted all 
incoming e-mail to be handled by his secretary, this was not easy to arrange. 

5. The user interface was poorly integrated with the transmission system requiring users 
first to edit a file, then leave the editor and invoke the file transfer program. 

6. It was not possible to create and send messages containing a mixture of text, drawings, 
facsimile, and voice. 


As experience was gained, more elaborate e-mail systems were proposed. In 1982, the 
ARPANET e-mail proposals were published as RFC 821 (transmission protocol) and RFC 822 
(message format). Minor revisions, RFC 2821 and RFC 2822, have become Internet standards, 
but everyone still refers to Internet e-mail as RFC 822. 


In 1984, CCITT drafted its X.400 recommendation. After two decades of competition, e-mail 
systems based on RFC 822 are widely used, whereas those based on X.400 have disappeared. 
How a system hacked together by a handful of computer science graduate students beat an 
official international standard strongly backed by all the PTTs in the world, many governments, 
and a substantial part of the computer industry brings to mind the Biblical story of David and 
Goliath. 


The reason for RFC 822's success is not that it is so good, but that X.400 was so poorly 
designed and so complex that nobody could implement it well. Given a choice between a 


simple-minded, but working, RFC 822-based e-mail system and a supposedly truly wonderful, 
but nonworking, X.400 e-mail system, most organizations chose the former. Perhaps there is a 
lesson lurking in there somewhere. Consequently, our discussion of e-mail will focus on the 
Internet e-mail system. 


7.2.1 Architecture and Services 


In this section we will provide an overview of what e-mail systems can do and how they are 
organized. They normally consist of two subsystems: the user agents, which allow people to 
read and send e-mail, and the message transfer agents, which move the messages from 
the source to the destination. The user agents are local programs that provide a command- 
based, menu-based, or graphical method for interacting with the e-mail system. The message 
transfer agents are typically system daemons, that is, processes that run in the background. 
Their job is to move e-mail through the system. 


Typically, e-mail systems support five basic functions. Let us take a look at them. 


Composition refers to the process of creating messages and answers. Although any text 
editor can be used for the body of the message, the system itself can provide assistance with 
addressing and the numerous header fields attached to each message. For example, when 
answering a message, the e-mail system can extract the originator's address from the 
incoming e-mail and automatically insert it into the proper place in the reply. 


Transfer refers to moving messages from the originator to the recipient. In large part, this 
requires establishing a connection to the destination or some intermediate machine, outputting 
the message, and releasing the connection. The e-mail system should do this automatically, 
without bothering the user. 


Reporting has to do with telling the originator what happened to the message. Was it 
delivered? Was it rejected? Was it lost? Numerous applications exist in which confirmation of 
delivery is important and may even have legal significance ("Well, Your Honor, my e-mail 
system is not very reliable, so I guess the electronic subpoena just got lost somewhere"). 


Displaying incoming messages is needed so people can read their e-mail. Sometimes 
conversion is required or a special viewer must be invoked, for example, if the message is a 
PostScript file or digitized voice. Simple conversions and formatting are sometimes attempted 
as well. 


Disposition is the final step and concerns what the recipient does with the message after 
receiving it. Possibilities include throwing it away before reading, throwing it away after 
reading, saving it, and so on. It should also be possible to retrieve and reread saved 
messages, forward them, or process them in other ways. 


In addition to these basic services, some e-mail systems, especially internal corporate ones, 
provide a variety of advanced features. Let us just briefly mention a few of these. When people 
move or when they are away for some period of time, they may want their e-mail forwarded, 
so the system should be able to do this automatically. 


Most systems allow users to create mailboxes to store incoming e-mail. Commands are 
needed to create and destroy mailboxes, inspect the contents of mailboxes, insert and delete 
messages from mailboxes, and so on. 


Corporate managers often need to send a message to each of their subordinates, customers, 
or suppliers. This gives rise to the idea of a mailing list, which is a list of e-mail addresses. 
When a message is sent to the mailing list, identical copies are delivered to everyone on the 
list. 


Other advanced features are carbon copies, blind carbon copies, high-priority e-mail, secret 


(i.e., encrypted) e-mail, alternative recipients if the primary one is not currently available, and 
the ability for secretaries to read and answer their bosses' e-mail. 


E-mail is now widely used within industry for intracompany communication. It allows far-flung 
employees to cooperate on complex projects, even over many time zones. By eliminating most 
cues associated with rank, age, and gender, e-mail debates tend to focus on ideas, not on 
corporate status. With e-mail, a brilliant idea from a summer student can have more impact 
than a dumb one from an executive vice president. 


A key idea in e-mail systems is the distinction between the envelope and its contents. The 
envelope encapsulates the message. It contains all the information needed for transporting the 
message, such as the destination address, priority, and security level, all of which are distinct 
from the message itself. The message transport agents use the envelope for routing, just as 
the post office does. 


The message inside the envelope consists of two parts: the header and the body. The header 
contains control information for the user agents. The body is entirely for the human recipient. 
Envelopes and messages are illustrated in Fig. 7-7. 


Figure 7-7. Envelopes and messages. (a) Paper mail. (b) Electronic 


mail. 
a 


Name: Mr. Daniel Dumkopf 
Street: 18 Willow Lane 
City: White Plains 


Mr. Daniel Dumkopf : Envelope 
18 Willow Lane Zip code: 10004 f 
White Plains, NY 10604 Priority: Urgent 

Encryption: None 


From: United Gizmo 
Address: 180 Main St. 
Location: Boston, MA 02120 
Date: Sept. 1, 2002 
Subject: Invoice 1081 


United Gizmo 
180 Main St 
Boston, MA 02120 


Sept. 1, 2002 
Subject: Invoice 1081 


Dear Mr. Dumkopf, 

Our computer records 
show that you still have > Message 
not paid the above invoice 
of $0.00. Please send us a 


Dear Mr. Dumkopf, 

Our computer records 
show that you still have 
not paid the above invoice 
of $0.00. Please send us a 


check for $0.00 promptly. check for $0.00 promptly. 
Yours truly Yours truly 
United Gizmo United Gizmo 


|. — Body ——9 Hener -- Envelope —- 


(a) (b) 
7.2.2 The User Agent 


E-mail systems have two basic parts, as we have seen: the user agents and the message 
transfer agents. In this section we will look at the user agents. A user agent is normally a 
program (sometimes called a mail reader) that accepts a variety of commands for composing, 
receiving, and replying to messages, as well as for manipulating mailboxes. Some user agents 
have a fancy menu- or icon-driven interface that requires a mouse, whereas others expect 1- 
character commands from the keyboard. Functionally, these are the same. Some systems are 
menu- or icon-driven but also have keyboard shortcuts. 


Sending E-mail 


To send an e-mail message, a user must provide the message, the destination address, and 
possibly some other parameters. The message can be produced with a free-standing text 
editor, a word processing program, or possibly with a specialized text editor built into the user 
agent. The destination address must be in a format that the user agent can deal with. Many 
user agents expect addresses of the form user@dns-address. Since we have studied DNS 
earlier in this chapter, we will not repeat that material here. 


However, it is worth noting that other forms of addressing exist. In particular, X.400 addresses 
look radically different from DNS addresses. They are composed of attribute = value pairs 
separated by slashes, for example, 


/C-US/ST-MASSACHUSETTS/L-CAMBRIDGE/PA-360 MEMORIAL DR./CN-KEN SMITH/ 


This address specifies a country, state, locality, personal address and a common name (Ken 
Smith). Many other attributes are possible, so you can send e-mail to someone whose exact e- 
mail address you do not know, provided you know enough other attributes (e.g., company and 
job title). Although X.400 names are considerably less convenient than DNS names, most e- 
mail systems have aliases (sometimes called nicknames) that allow users to enter or select a 
person's name and get the correct e-mail address. Consequently, even with X.400 addresses, 
it is usually not necessary to actually type in these strange strings. 


Most e-mail systems support mailing lists, so that a user can send the same message to a list 
of people with a single command. If the mailing list is maintained locally, the user agent can 
just send a separate message to each intended recipient. However, if the list is maintained 
remotely, then messages will be expanded there. For example, if a group of bird watchers has 
a mailing list called birders installed on meadowlark.arizona.edu, then any message sent to 
birders@meadowlark. arizona.edu will be routed to the University of Arizona and expanded 
there into individual messages to all the mailing list members, wherever in the world they may 
be. Users of this mailing list cannot tell that it is a mailing list. It could just as well be the 
personal mailbox of Prof. Gabriel O. Birders. 


Reading E-mail 


Typically, when a user agent is started up, it looks at the user's mailbox for incoming e-mail 
before displaying anything on the screen. Then it may announce the number of messages in 
the mailbox or display a one-line summary of each one and wait for a command. 


As an example of how a user agent works, let us take a look at a typical mail scenario. After 
starting up the user agent, the user asks for a summary of his e-mail. A display like that of 
Fig. 7-8 then appears on the screen. Each line refers to one message. In this example, the 
mailbox contains eight messages. 


Figure 7-8. An example display of the contents of a mailbox. 


r Flags Bytes Sender Subject 
1 K 1030 asw Changes to MINIX 
2 KA 6348 trudy Not all Trudys are nasty 
3 KF 4519 Amy N.Wong Request for information 
4 1236 bal Bioinformatics 
5 104110 — kaashoek Material on peer-to-peer 
6 1223 Frank Re: Will you review a grant proposal 
7 3110 guido Our paper has been accepted 
8 1204 dmr Re: My student's visit 


Each line of the display contains several fields extracted from the envelope or header of the 
corresponding message. In a simple e-mail system, the choice of fields displayed is built into 
the program. In a more sophisticated system, the user can specify which fields are to be 
displayed by providing a user profile, a file describing the display format. In this basic 
example, the first field is the message number. The second field, Flags, can contain a K, 
meaning that the message is not new but was read previously and kept in the mailbox; an A, 
meaning that the message has already been answered; and/or an F, meaning that the 
message has been forwarded to someone else. Other flags are also possible. 


The third field tells how long the message is, and the fourth one tells who sent the message. 
Since this field is simply extracted from the message, this field may contain first names, full 
names, initials, login names, or whatever else the sender chooses to put there. Finally, the 
Subject field gives a brief summary of what the message is about. People who fail to include a 
Subject field often discover that responses to their e-mail tend not to get the highest priority. 


After the headers have been displayed, the user can perform any of several actions, such as 
displaying a message, deleting a message, and so on. The older systems were text based and 
typically used one-character commands for performing these tasks, such as T (type message), 
A (answer message), D (delete message), and F (forward message). An argument specified 
the message in question. More recent systems use graphical interfaces. Usually, the user 
selects a message with the mouse and then clicks on an icon to type, answer, delete, or 
forward it. 


E-mail has come a long way from the days when it was just file transfer. Sophisticated user 
agents make managing a large volume of e-mail possible. For people who receive and send 
thousands of messages a year, such tools are invaluable. 


7.2.3 Message Formats 


Let us now turn from the user interface to the format of the e-mail messages themselves. First 
we will look at basic ASCII e-mail using RFC 822. After that, we will look at multimedia 
extensions to RFC 822. 


RFC 822 


Messages consist of a primitive envelope (described in RFC 821), some number of header 
fields, a blank line, and then the message body. Each header field (logically) consists of a 
single line of ASCII text containing the field name, a colon, and, for most fields, a value. RFC 
822 was designed decades ago and does not clearly distinguish the envelope fields from the 
header fields. Although it was revised in RFC 2822, completely redoing it was not possible due 
to its widespread usage. In normal usage, the user agent builds a message and passes it to 
the message transfer agent, which then uses some of the header fields to construct the actual 
envelope, a somewhat old-fashioned mixing of message and envelope. 


The principal header fields related to message transport are listed in Fig. 7-9. The To: field 
gives the DNS address of the primary recipient. Having multiple recipients is also allowed. The 
Cc: field gives the addresses of any secondary recipients. In terms of delivery, there is no 
distinction between the primary and secondary recipients. It is entirely a psychological 
difference that may be important to the people involved but is not important to the mail 
system. The term Cc: (Carbon copy) is a bit dated, since computers do not use carbon paper, 
but it is well established. The Bcc: (Blind carbon copy) field is like the Cc: field, except that 
this line is deleted from all the copies sent to the primary and secondary recipients. This 
feature allows people to send copies to third parties without the primary and secondary 
recipients knowing this. 


Figure 7-9. RFC 822 header fields related to message transport. 


| Meaning 
To: | E-mail address(es) of primary recipient(s) 
Cc: E-mail address(es) of secondary recipient(s) 
Bcc: , E-mail address(es) for blind carbon copies 
From: | Person or people who created the message 
Sender: , E-mail address of the actual sender 
Received: , Line added by each transfer agent along the route 


Retum-Path: Can be used to identify a path back to the sender 


The next two fields, From: and Sender:, tell who wrote and sent the message, respectively. 
These need not be the same. For example, a business executive may write a message, but her 
secretary may be the one who actually transmits it. In this case, the executive would be listed 
in the From: field and the secretary in the Sender: field. The From: field is required, but the 
Sender: field may be omitted if it is the same as the From: field. These fields are needed in 
case the message is undeliverable and must be returned to the sender. 


A line containing Received: is added by each message transfer agent along the way. The line 
contains the agent's identity, the date and time the message was received, and other 
information that can be used for finding bugs in the routing system. 


The Return-Path: field is added by the final message transfer agent and was intended to tell 
how to get back to the sender. In theory, this information can be gathered from all the 
Received: headers (except for the name of the sender's mailbox), but it is rarely filled in as 
such and typically just contains the sender's address. 


In addition to the fields of Fig. 7-9, RFC 822 messages may also contain a variety of header 
fields used by the user agents or human recipients. The most common ones are listed in Fig. 
7-10. Most of these are self-explanatory, so we will not go into all of them in detail. 


Figure 7-10. Some fields used in the RFC 822 message header. 


Header Meaning 
Date: , The date and time the message was sent 
Reply-To: | E-mail address to which replies should be sent 


Message-Id: | Unique number for referencing this message later 
In-Reply-To: | Message-Id of the message to which this is a reply 
References: | Other relevant Message-Ids 

Keywords: | User-chosen keywords 

Subject: Short summary of the message for the one-line display 


The Reply-To: field is sometimes used when neither the person composing the message nor 
the person sending the message wants to see the reply. For example, a marketing manager 
writes an e-mail message telling customers about a new product. The message is sent by a 
secretary, but the Reply-To: field lists the head of the sales department, who can answer 
questions and take orders. This field is also useful when the sender has two e-mail accounts 
and wants the reply to go to the other one. 


The RFC 822 document explicitly says that users are allowed to invent new headers for their 
own private use, provided that these headers start with the string X-. It is guaranteed that no 
future headers will use names starting with X-, to avoid conflicts between official and private 
headers. Sometimes wiseguy undergraduates make up fields like X-Fruit-of-the-Day: or X- 
Disease-of-the-Week:, which are legal, although not always illuminating. 


After the headers comes the message body. Users can put whatever they want here. Some 
people terminate their messages with elaborate signatures, including simple ASCII cartoons, 
quotations from greater and lesser authorities, political statements, and disclaimers of all kinds 
(e.g., The XYZ Corporation is not responsible for my opinions; in fact, it cannot even 
comprehend them). 


MIME—The Multipurpose Internet Mail Extensions 


In the early days of the ARPANET, e-mail consisted exclusively of text messages written in 
English and expressed in ASCII. For this environment, RFC 822 did the job completely: it 
specified the headers but left the content entirely up to the users. Nowadays, on the worldwide 
Internet, this approach is no longer adequate. The problems include sending and receiving 


Messages in languages with accents (e.g., French and German). 
Messages in non-Latin alphabets (e.g., Hebrew and Russian). 

Messages in languages without alphabets (e.g., Chinese and Japanese). 
Messages not containing text at all (e.g., audio or images). 


Dope 


A solution was proposed in RFC 1341 and updated in RFCs 2045-2049. This solution, called 
MIME (Multipurpose Internet Mail Extensions) is now widely used. We will now describe 
it. For additional information about MIME, see the RFCs. 


The basic idea of MIME is to continue to use the RFC 822 format, but to add structure to the 
message body and define encoding rules for non-ASCII messages. By not deviating from RFC 
822, MIME messages can be sent using the existing mail programs and protocols. All that has 
to be changed are the sending and receiving programs, which users can do for themselves. 


MIME defines five new message headers, as shown in Fig. 7-11. The first of these simply tells 
the user agent receiving the message that it is dealing with a MIME message, and which 
version of MIME it uses. Any message not containing a MIME-Version: header is assumed to be 
an English plaintext message and is processed as such. 


Figure 7-11. RFC 822 headers added by MIME. 


Header Meaning 
MIME- Version: | Identifies the MIME version 
Content-Description: Human-readable string telling what is in the message 
Content-Id: | Unique identifier 
| Content-Transfer-Encoding: | How the body is wrapped for transmission 
Content-Type: Type and format of the content 


The Content-Description: header is an ASCII string telling what is in the message. This header 
is needed so the recipient will know whether it is worth decoding and reading the message. If 
the string says: "Photo of Barbara's hamster" and the person getting the message is not a big 
hamster fan, the message will probably be discarded rather than decoded into a high- 
resolution color photograph. 


The Content-Id: header identifies the content. It uses the same format as the standard 
Message-Id: header. 


The Content-Transfer-Encoding: tells how the body is wrapped for transmission through a 
network that may object to most characters other than letters, numbers, and punctuation 
marks. Five schemes (plus an escape to new schemes) are provided. The simplest scheme is 
just ASCII text. ASCII characters use 7 bits and can be carried directly by the e-mail protocol 
provided that no line exceeds 1000 characters. 


The next simplest scheme is the same thing, but using 8-bit characters, that is, all values from 


O up to and including 255. This encoding scheme violates the (original) Internet e-mail 
protocol but is used by some parts of the Internet that implement some extensions to the 
original protocol. While declaring the encoding does not make it legal, having it explicit may at 
least explain things when something goes wrong. Messages using the 8-bit encoding must still 
adhere to the standard maximum line length. 


Even worse are messages that use binary encoding. These are arbitrary binary files that not 
only use all 8 bits but also do not even respect the 1000-character line limit. Executable 
programs fall into this category. No guarantee is given that messages in binary will arrive 
correctly, but some people try anyway. 


The correct way to encode binary messages is to use base64 encoding, sometimes called 
ASCII armor. In this scheme, groups of 24 bits are broken up into four 6-bit units, with each 
unit being sent as a legal ASCII character. The coding is "A" for 0, "B" for 1, and so on, 
followed by the 26 lower-case letters, the ten digits, and finally + and / for 62 and 63, 
respectively. The == and = sequences indicate that the last group contained only 8 or 16 bits, 
respectively. Carriage returns and line feeds are ignored, so they can be inserted at will to 
keep the lines short enough. Arbitrary binary text can be sent safely using this scheme. 


For messages that are almost entirely ASCII but with a few non-ASCII characters, base64 
encoding is somewhat inefficient. Instead, an encoding known as quoted-printable encoding 
is used. This is just 7-bit ASCII, with all the characters above 127 encoded as an equal sign 
followed by the character's value as two hexadecimal digits. 


In summary, binary data should be sent encoded in base64 or quoted-printable form. When 
there are valid reasons not to use one of these schemes, it is possible to specify a user-defined 
encoding in the Content-Transfer-Encoding: header. 


The last header shown in Fig. 7-11 is really the most interesting one. It specifies the nature of 
the message body. Seven types are defined in RFC 2045, each of which has one or more 
subtypes. The type and subtype are separated by a slash, as in 


Content-Type: video/mpeg 
The subtype must be given explicitly in the header; no defaults are provided. The initial list of 


types and subtypes specified in RFC 2045 is given in Fig. 7-12. Many new ones have been 
added since then, and additional entries are being added all the time as the need arises. 


Figure 7-12. The MIME types and subtypes defined in RFC 2045. 


Type Subtype Description 

Text Plain Unformatted text 
Enriched Text including simple formatting commands 
Gif Still picture in GIF format 

Image —- ; 
Jpeg Still picture in JPEG format 

Audio | Basic Audible sound 

Video Mpeg Movie in MPEG format 

"um Octet-stream An uninterpreted byte sequence 

Application . . . 
Postscript A printable document in PostScript 
Ríc822 A MIME RFC 822 message 

Message Partial | Message has been split for transmission 
External-body | Message itself must be fetched over the net 
Mixed Independent parts in the specified order 

Multipart Alternative Same message in different formats 
Parallel Parts must be viewed simultaneously 
Digest Each part is a complete RFC 822 message 


Let us now go briefly through the list of types. The text type is for straight ASCII text. The 
text/plain combination is for ordinary messages that can be displayed as received, with no 
encoding and no further processing. This option allows ordinary messages to be transported in 
MIME with only a few extra headers. 


The text/enriched subtype allows a simple markup language to be included in the text. This 
language provides a system-independent way to express boldface, italics, smaller and larger 
point sizes, indentation, justification, sub- and superscripting, and simple page layout. The 
markup language is based on SGML, the Standard Generalized Markup Language also used as 
the basis for the World Wide Web's HTML. For example, the message 


The «bold» time «/bold» has come the «italic» walrus «/italic» said ... 


would be displayed as 
The time has come the walrus said ... 


It is up to the receiving system to choose the appropriate rendition. If boldface and italics are 
available, they can be used; otherwise, colors, blinking, underlining, reverse video, etc., can 
be used for emphasis. Different systems can, and do, make different choices. 


When the Web became popular, a new subtype text/html! was added (in RFC 2854) to allow 
Web pages to be sent in RFC 822 e-mail. A subtype for the extensible markup language, 
text/xml, is defined in RFC 3023. We will study HTML and XML later in this chapter. 


The next MIME type is image, which is used to transmit still pictures. Many formats are widely 
used for storing and transmitting images nowadays, both with and without compression. Two 
of these, GIF and JPEG, are built into nearly all browsers, but many others exist as well and 
have been added to the original list. 


The audio and video types are for sound and moving pictures, respectively. Please note that 
video includes only the visual information, not the soundtrack. If a movie with sound is to be 
transmitted, the video and audio portions may have to be transmitted separately, depending 
on the encoding system used. The first video format defined was the one devised by the 
modestly-named Moving Picture Experts Group (MPEG), but others have been added since. In 
addition to audio/basic, a new audio type, audio/mpeg was added in RFC 3003 to allow people 
to e-mail MP3 audio files. 


The application type is a catchall for formats that require external processing not covered by 
one of the other types. An octet-stream is just a sequence of uninterpreted bytes. Upon 
receiving such a stream, a user agent should probably display it by suggesting to the user that 
it be copied to a file and prompting for a file name. Subsequent processing is then up to the 
user. 


The other defined subtype is postscript, which refers to the PostScript language defined by 
Adobe Systems and widely used for describing printed pages. Many printers have built-in 
PostScript interpreters. Although a user agent can just call an external PostScript interpreter to 
display incoming PostScript files, doing so is not without some danger. PostScript is a full- 
blown programming language. Given enough time, a sufficiently masochistic person could write 
a C compiler or a database management system in PostScript. Displaying an incoming 
PostScript message is done by executing the PostScript program contained in it. In addition to 
displaying some text, this program can read, modify, or delete the user's files, and have other 
nasty side effects. 


The message type allows one message to be fully encapsulated inside another. This scheme is 
useful for forwarding e-mail, for example. When a complete RFC 822 message is encapsulated 
inside an outer message, the rfc822 subtype should be used. 


The partial subtype makes it possible to break an encapsulated message into pieces and send 
them separately (for example, if the encapsulated message is too long). Parameters make it 
possible to reassemble all the parts at the destination in the correct order. 


Finally, the external-body subtype can be used for very long messages (e.g., video films). 
Instead of including the MPEG file in the message, an FTP address is given and the receiver's 
user agent can fetch it over the network at the time it is needed. This facility is especially 
useful when sending a movie to a mailing list of people, only a few of whom are expected to 
view it (think about electronic junk mail containing advertising videos). 


The final type is multipart, which allows a message to contain more than one part, with the 
beginning and end of each part being clearly delimited. The mixed subtype allows each part to 
be different, with no additional structure imposed. Many e-mail programs allow the user to 


provide one or more attachments to a text message. These attachments are sent using the 
multipart type. 


In contrast to multipart, the alternative subtype, allows the same message to be included 
multiple times but expressed in two or more different media. For example, a message could be 
sent in plain ASCII, in enriched text, and in PostScript. A properly-designed user agent getting 
such a message would display it in PostScript if possible. Second choice would be enriched 
text. If neither of these were possible, the flat ASCII text would be displayed. The parts should 
be ordered from simplest to most complex to help recipients with pre-MIME user agents make 
some sense of the message (e.g., even a pre-MIME user can read flat ASCII text). 


The alternative subtype can also be used for multiple languages. In this context, the Rosetta 
Stone can be thought of as an early multipart/alternative message. 


A multimedia example is shown in Fig. 7-13. Here a birthday greeting is transmitted both as 
text and as a song. If the receiver has an audio capability, the user agent there will fetch the 
sound file, birthday.snd, and play it. If not, the lyrics are displayed on the screen in stony 
silence. The parts are delimited by two hyphens followed by a (software-generated) string 
specified in the boundary parameter. 


Figure 7-13. A multipart message containing enriched and audio 
alternatives. 


From: elinor@ abcd.com 

To: carolyn 9 xyz.com 

MIME-Version: 1.0 

Message-Id: «0704760941.AA00747 @ abcd.com» 

Content-Type: multipart/alternative; boundary-qwertyuiopasdfghjklzxcvbnm 
Subject: Earth orbits sun integral number of times 


This is the preamble. The user agent ignores it. Have a nice day. 


--qwertyuiopasdfghjklzxcvbnm 
Content-Type: text/enriched 


Happy birthday to you 
Happy birthday to you 
Happy birthday dear «bold» Carolyn </bold> 
Happy birthday to you 


--qwertyuiopasdfghjklzxcvbnm 

Content-Type: message/external-body; 
access-type-"anon-ftp*; 
site-"bicycle.abcd.com"; 
directory-"pub"; 
name="birthday.snd" 


content-type: audio/basic 
content-transfer-encoding: base64 
--qwertyuiopasdfghjklzxcvbnm-- 


Note that the Content-Type header occurs in three positions within this example. At the top 
level, it indicates that the message has multiple parts. Within each part, it gives the type and 
subtype of that part. Finally, within the body of the second part, it is required to tell the user 
agent what kind of an external file it is to fetch. To indicate this slight difference in usage, we 
have used lower case letters here, although all headers are case insensitive. The content- 
transfer-encoding is similarly required for any external body that is not encoded as 7-bit 
ASCII. 


Getting back to the subtypes for multipart messages, two more possibilities exist. The parallel 
subtype is used when all parts must be "viewed" simultaneously. For example, movies often 
have an audio channel and a video channel. Movies are more effective if these two channels 
are played back in parallel, instead of consecutively. 


Finally, the digest subtype is used when many messages are packed together into a composite 
message. For example, some discussion groups on the Internet collect messages from 
subscribers and then send them out to the group as a single multipart/digest message. 


7.2.4 Message Transfer 


The message transfer system is concerned with relaying messages from the originator to the 
recipient. The simplest way to do this is to establish a transport connection from the source 
machine to the destination machine and then just transfer the message. After examining how 
this is normally done, we will examine some situations in which this does not work and what 
can be done about them. 


SMTP—The Simple Mail Transfer Protocol 


Within the Internet, e-mail is delivered by having the source machine establish a TCP 
connection to port 25 of the destination machine. Listening to this port is an e-mail daemon 
that speaks SMTP (Simple Mail Transfer Protocol). This daemon accepts incoming 
connections and copies messages from them into the appropriate mailboxes. If a message 


cannot be delivered, an error report containing the first part of the undeliverable message is 
returned to the sender. 


SMTP is a simple ASCII protocol. After establishing the TCP connection to port 25, the sending 
machine, operating as the client, waits for the receiving machine, operating as the server, to 
talk first. The server starts by sending a line of text giving its identity and telling whether it is 
prepared to receive mail. If it is not, the client releases the connection and tries again later. 


If the server is willing to accept e-mail, the client announces whom the e-mail is coming from 
and whom it is going to. If such a recipient exists at the destination, the server gives the client 
the go-ahead to send the message. Then the client sends the message and the server 
acknowledges it. No checksums are needed because TCP provides a reliable byte stream. If 
there is more e-mail, that is now sent. When all the e-mail has been exchanged in both 
directions, the connection is released. A sample dialog for sending the message of Fig. 7-13, 
including the numerical codes used by SMTP, is shown in Fig. 7-14. The lines sent by the client 
are marked C:. Those sent by the server are marked S:. 


Figure 7-14. Transferring a message from elinor@abcd.com to 
carolyn@xyz.com. 


S: 220 xyz.com SMTP service ready 
C: HELO abcd.com 
S: 250 xyz.com says hello to abcd.com 
C: MAIL FROM: «elinor& abcd.com» 
S: 250 sender ok 
C: RCPT TO: «carolyn ? xyz.com» 
S: 250 recipient ok 
C: DATA 
S: 354 Send mail; end with *." on a line by itself 
C: From: elinor& abcd.com 
C: To: carolyn ? xyz.com 
C: MIME-Version: 1.0 
C: Message-Id: «0704760941.AA00747 @abcd.com> 
C: Content-Type: multipart/alternative; boundary=qwertyuiopasdfghjkizxcvbnm 
C: Subject: Earth orbits sun integral number of times 
C: 
C: This is the preamble. The user agent ignores it. Have a nice day. 
C: 
C: --qwertyuiopasdfghjklzxcvbnm 
C: Content-Type: text/enriched 
G: 


C: Happy birthday to you 

C: Happy birthday to you 

C: Happy birthday dear «bold» Carolyn </bold> 
C: Happy birthday to you 

C: 

C: --qwertyuiopasdfghjklzxcvbnm 

C: Content-Type: message/external-body; 
access-type-"anon-ftp"; 
sitez"bicycle.abcd.com"; 
directory-" pub"; 
name="birthday.snd" 


C: content-type: audio/basic 

C: content-transfer-encoding: base64 
C: --qwertyuiopasdfghjkizxcvbnm 

Gi. 


S: 250 message accepted 
C: QUIT 
S: 221 xyz.com closing connection 


A few comments about Fig. 7-14 may be helpful. The first command from the client is indeed 
HELO. Of the various four-character abbreviations for HELLO, this one has numerous 
advantages over its biggest competitor. Why all the commands had to be four characters has 
been lost in the mists of time. 


In Fig. 7-14, the message is sent to only one recipient, so only one RCPT command is used. 
Such commands are allowed to send a single message to multiple receivers. Each one is 
individually acknowledged or rejected. Even if some recipients are rejected (because they do 
not exist at the destination), the message can be sent to the other ones. 


Finally, although the syntax of the four-character commands from the client is rigidly specified, 
the syntax of the replies is less rigid. Only the numerical code really counts. Each 
implementation can put whatever string it wants after the code. 


To get a better feel for how SMTP and some of the other protocols described in this chapter 
work, try them out. In all cases, first go to a machine connected to the Internet. On a UNIX 
system, in a shell, type 


telnet mail.isp.com 25 


substituting the DNS name of your ISP's mail server for mail.isp.com. On a Windows system, 
click on Start, then Run, and type the command in the dialog box. This command will establish 
a telnet (i.e., TCP) connection to port 25 on that machine. Port 25 is the SMTP port (see Fig. 6- 
27 for some common ports). You will probably get a response something like this: 


Trying 192.30.200.9596.... 

Connected to mail.isp.com 

Escape character is '^]'. 

220 mail.isp.com Smail #74 ready at Thu, 25 Sept 2002 13:26 +0200 


The first three lines are from telnet telling you what it is doing. The last line is from the SMTP 
server on the remote machine announcing its willingness to talk to you and accept e-mail. To 
find out what commands it accepts, type 


HELP 


From this point on, a command sequence such as the one in Fig. 7-14 is possible, starting with 
the client's HELO command. 


It is worth noting that the use of lines of ASCII text for commands is not an accident. Most 
Internet protocols work this way. Using ASCII text makes the protocols easy to test and 
debug. They can be tested by sending commands manually, as we saw above, and dumps of 
the messages are easy to read. 


Even though the SMTP protocol is completely well defined, a few problems can still arise. One 
problem relates to message length. Some older implementations cannot handle messages 
exceeding 64 KB. Another problem relates to timeouts. If the client and server have different 
timeouts, one of them may give up while the other is still busy, unexpectedly terminating the 
connection. Finally, in rare situations, infinite mailstorms can be triggered. For example, if host 
1 holds mailing list A and host 2 holds mailing list B and each list contains an entry for the 
other one, then a message sent to either list could generate a never-ending amount of e-mail 
traffic unless somebody checks for it. 


To get around some of these problems, extended SMTP (ESMTP) has been defined in RFC 
2821. Clients wanting to use it should send an EHLO message instead of HELO initially. If this 
is rejected, then the server is a regular SMTP server, and the client should proceed in the usual 
way. If the EHLO is accepted, then new commands and parameters are allowed. 


7.2.5 Final Delivery 


Up until now, we have assumed that all users work on machines that are capable of sending 
and receiving e-mail. As we saw, e-mail is delivered by having the sender establish a TCP 
connection to the receiver and then ship the e-mail over it. This model worked fine for decades 
when all ARPANET (and later Internet) hosts were, in fact, on-line all the time to accept TCP 
connections. 


However, with the advent of people who access the Internet by calling their ISP over a 
modem, it breaks down. The problem is this: what happens when Elinor wants to send Carolyn 
e-mail and Carolyn is not currently on-line? Elinor cannot establish a TCP connection to Carolyn 
and thus cannot run the SMTP protocol. 


One solution is to have a message transfer agent on an ISP machine accept e-mail for its 
customers and store it in their mailboxes on an ISP machine. Since this agent can be on-line 
all the time, e-mail can be sent to it 24 hours a day. 


POP3 


Unfortunately, this solution creates another problem: how does the user get the e-mail from 
the ISP's message transfer agent? The solution to this problem is to create another protocol 
that allows user transfer agents (on client PCs) to contact the message transfer agent (on the 
ISP's machine) and allow e-mail to be copied from the ISP to the user. One such protocol is 
POP3 (Post Office Protocol Version 3), which is described in RFC 1939. 


The situation that used to hold (both sender and receiver having a permanent connection to 
the Internet) is illustrated in Fig. 7-15(a). A situation in which the sender is (currently) on-line 
but the receiver is not is illustrated in Fig. 7-15(b). 


Figure 7-15. (a) Sending and reading mail when the receiver has a 
permanent Internet connection and the user agent runs on the same 
machine as the message transfer agent. (b) Reading e-mail when the 
receiver has a dial-up connection to an ISP. 
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POP3 begins when the user starts the mail reader. The mail reader calls up the ISP (unless 
there is already a connection) and establishes a TCP connection with the message transfer 
agent at port 110. Once the connection has been established, the POP3 protocol goes through 
three states in sequence: 


1. Authorization. 
2. Transactions. 
3. Update. 


The authorization state deals with having the user log in. The transaction state deals with the 
user collecting the e-mails and marking them for deletion from the mailbox. The update state 
actually causes the e-mails to be deleted. 


This behavior can be observed by typing something like: 


telnet mail.isp.com 110 


where mail.isp.com represents the DNS name of your ISP's mail server. Telnet establishes a 
TCP connection to port 110, on which the POP3 server listens. Upon accepting the TCP 
connection, the server sends an ASCII message announcing that it is present. Usually, it 
begins with «ok followed by a comment. An example scenario is shown in Fig. 7-16 starting 
after the TCP connection has been established. As before, the lines marked C: are from the 


client (user) and those marked S: are from the server (message transfer agent on the ISP's 
machine). 


Figure 7-16. Using POP3 to fetch three messages. 


S: «OK POPS server ready 
C: USER carolyn 

S: «OK 
C: PASS vegetables 

S: «OK login successful 


C: LIST 

S: 1 2505 

S: 2 14302 

S: 3 8122 

S. 
C: RETR 1 

S: (sends message 1) 
C: DELE 1 
C: RETR 2 

S: (sends message 2) 
C: DELE 2 
C: RETR 3 

S: (sends message 3) 
C: DELE 3 
C: QUIT 


S: «OK POPS server disconnecting 


During the authorization state, the client sends over its user name and then its password. After 
a successful login, the client can then send over the LIST com 


mand, which causes the server to list the contents of the mailbox, one message per line, 
giving the length of that message. The list is terminated by a period. 


Then the client can retrieve messages using the RETR command and mark them for deletion 
with DELE. When all messages have been retrieved (and possibly marked for deletion), the 
client gives the QUIT command to terminate the transaction state and enter the update state. 
When the server has deleted all the messages, it sends a reply and breaks the TCP connection. 


While it is true that the POP3 protocol supports the ability to download a specific message or 
set of messages and leave them on the server, most e-mail programs just download 
everything and empty the mailbox. This behavior means that in practice, the only copy is on 
the user's hard disk. If that crashes, all e-mail may be lost permanently. 


Let us now briefly summarize how e-mail works for ISP customers. Elinor creates a message 


for Carolyn using some e-mail program (i.e., user agent) and clicks on an icon to send it. The 
e-mail program hands the message over to the message transfer agent on Elinor's host. The 
message transfer agent sees that it is directed to carolyn@xyz.com so it uses DNS to look up 
the MX record for xyz.com (where xyz.com is Carolyn's ISP). This query returns the DNS name 
of xyz.com's mail server. The message transfer agent now looks up the IP address of this 
machine using DNS again, for example, using gethostbyname. It then establishes a TCP 
connection to the SMTP server on port 25 of this machine. Using an SMTP command sequence 
analogous to that of Fig. 7-14, it transfers the message to Carolyn's mailbox and breaks the 
TCP connection. 


In due course of time, Carolyn boots up her PC, connects to her ISP, and starts her e-mail 
program. The e-mail program establishes a TCP connection to the POP3 server at port 110 of 
the ISP's mail server machine. The DNS name or IP address of this machine is typically 
configured when the e-mail program is installed or the subscription to the ISP is made. After 


the TCP connection has been established, Carolyn's e-mail program runs the POP3 protocol to 
fetch the contents of the mailbox to her hard disk using commands similar to those of Fig. 7- 
16. Once all the e-mail has been transferred, the TCP connection is released. In fact, the 
connection to the ISP can also be broken now, since all the e-mail is on Carolyn's hard disk. Of 
course, to send a reply, the connection to the ISP will be needed again, so it is not generally 
broken right after fetching the e-mail. 


IMAP 


For a user with one e-mail account at one ISP that is always accessed from one PC, POP3 
works fine and is widely used due to its simplicity and robustness. However, it is a computer- 
industry truism that as soon as something works well, somebody will start demanding more 
features (and getting more bugs). That happened with e-mail, too. For example, many people 
have a single e-mail account at work or school and want to access it from work, from their 
home PC, from their laptop when on business trips, and from cybercafes when on so-called 
vacation. While POP3 allows this, since it normally downloads all stored messages at each 
contact, the result is that the user's e-mail quickly gets spread over multiple machines, more 
or less at random, some of them not even the user's. 


This disadvantage gave rise to an alternative final delivery protocol, IMAP (Internet 
Message Access Protocol), which is defined in RFC 2060. Unlike POP3, which basically 
assumes that the user will clear out the mailbox on every contact and work off-line after that, 
IMAP assumes that all the e-mail will remain on the server indefinitely in multiple mailboxes. 
IMAP provides extensive mechanisms for reading messages or even parts of messages, a 
feature useful when using a slow modem to read the text part of a multipart message with 
large audio and video attachments. Since the working assumption is that messages will not be 
transferred to the user's computer for permanent storage, IMAP provides mechanisms for 
creating, destroying, and manipulating multiple mailboxes on the server. In this way a user 
can maintain a mailbox for each correspondent and move messages there from the inbox after 
they have been read. 


IMAP has many features, such as the ability to address mail not by arrival number as is done 
in Fig. 7-8, but by using attributes (e.g., give me the first message from Bobbie). Unlike POP3, 
IMAP can also accept outgoing e-mail for shipment to the destination as well as deliver 
incoming e-mail. 


The general style of the IMAP protocol is similar to that of POP3 as shown in Fig. 7-16, except 
that are there dozens of commands. The IMAP server listens to port 143. A comparison of 
POP3 and IMAP is given in Fig. 7-17. It should be noted, however, that not every ISP supports 
both protocols and not every e-mail program supports both protocols. Thus, when choosing an 
e-mail program, it is important to find out which protocol(s) it supports and make sure the ISP 
supports at least one of them. 


Figure 7-17. A comparison of POP3 and IMAP. 


Feature POP3 IMAP 
Where is protocol defined RFC 1939 RFC 2060 
TCP port used 110 | 143 
Where is e-mail stored Users PC | Server 
Where is e-mail read Off-line On-line 
Connect time required Little | Much 
Use of server resources Minimal Extensive 
Multiple mailboxes No | Yes 
Who backs up mailboxes User ISP 
Good for mobile users No | Yes 
User control over downloading | Little Great 
Partial message downloads No | Yes 
Are disk quotas a problem No | Could be in time 
Simple to implement Yes No 
Widespread support Yes | Growing 


Delivery Features 


Independently of whether POP3 or IMAP is used, many systems provide hooks for additional 
processing of incoming e-mail. An especially valuable feature for many e-mail users is the 
ability to set up filters. These are rules that are checked when e-mail comes in or when the 
user agent is started. Each rule specifies a condition and an action. For example, a rule could 
say that any message received from the boss goes to mailbox number 1, any message from a 
select group of friends goes to mailbox number 2, and any message containing certain 
objectionable words in the Subject line is discarded without comment. 


Some ISPs provide a filter that automatically categorizes incoming e-mail as either important 
or spam (junk e-mail) and stores each message in the corresponding mailbox. Such filters 
typically work by first checking to see if the source is a known spammer. Then they usually 
examine the subject line. If hundreds of users have just received a message with the same 
subject line, it is probably spam. Other techniques are also used for spam detection. 


Another delivery feature often provided is the ability to (temporarily) forward incoming e-mail 
to a different address. This address can even be a computer operated by a commercial paging 
service, which then pages the user by radio or satellite, displaying the Subject: line on his 
pager. 


Still another common feature of final delivery is the ability to install a vacation daemon. This 
is a program that examines each incoming message and sends the sender an insipid reply such 
as 


Hi. I'm on vacation. I'll be back on the 24th of August. Have a nice summer. 


Such replies can also specify how to handle urgent matters in the interim, other people to 
contact for specific problems, etc. Most vacation daemons keep track of whom they have sent 
canned replies to and refrain from sending the same person a second reply. The good ones 
also check to see if the incoming message was sent to a mailing list, and if so, do not send a 
canned reply at all. (People who send messages to large mailing lists during the summer 
probably do not want to get hundreds of replies detailing everyone's vacation plans.) 


The author once ran into an extreme form of delivery processing when he sent an e-mail 
message to a person who claims to get 600 messages a day. His identity will not be disclosed 
here, lest half the readers of this book also send him e-mail. Let us call him John. 


John has installed an e-mail robot that checks every incoming message to see if it is from a 
new correspondent. If so, it sends back a canned reply explaining that John can no longer 
personally read all his e-mail. Instead, he has produced a personal FAQ (Frequently Asked 
Questions) document that answers many questions he is commonly asked. Normally, 
newsgroups have FAQs, not people. 


John's FAQ gives his address, fax, and telephone numbers and tells how to contact his 
company. It explains how to get him as a speaker and describes where to get his papers and 
other documents. It also provides pointers to software he has written, a conference he is 
running, a standard he is the editor of, and so on. Perhaps this approach is necessary, but 
maybe a personal FAQ is the ultimate status symbol. 


Webmail 


One final topic worth mentioning is Webmail. Some Web sites, for example, Hotmail and 
Yahoo, provide e-mail service to anyone who wants it. They work as follows. They have normal 
message transfer agents listening to port 25 for incoming SMTP connections. To contact, say, 
Hotmail, you have to acquire their DNS MX record, for example, by typing 


host -a -v hotmail.com 


on a UNIX system. Suppose that the mail server is called mx10.hotmail.com, then by typing 


telnet mx10.hotmail.com 25 


you can establish a TCP connection over which SMTP commands can be sent in the usual way. 
So far, nothing unusual, except that these big servers are often busy, so it may take several 
attempts to get a TCP connection accepted. 


The interesting part is how e-mail is delivered. Basically, when the user goes to the e-mail Web 
page, a form is presented in which the user is asked for a login name and password. When the user 
clicks on Sign In, the login name and password are sent to the server, which then validates them. If 
the login is successful, the server finds the user's mailbox and builds a listing similar to that of Fig. 
7-8, only formatted as a Web page in HTML. The Web page is then sent to the browser for display. 
Many of the items on the page are clickable, so messages can be read, deleted, and so on. 


7.3 Summary 


Naming in the Internet uses a hierarchical scheme called the domain name system (DNS). At 
the top level are the well-known generic domains, including com and edu as well as about 200 
country domains. DNS is implemented as a distributed database system with servers all over 
the world. DNS holds records with IP addresses, mail exchanges, and other information. By 
querying a DNS server, a process can map an Internet domain name onto the IP address used 
to communicate with that domain. 


E-mail is one of the two killer apps for the Internet. Everyone from small children to 
grandparents now use it. Most e-mail systems in the world use the mail system now defined in 
RFCs 2821 and 2822. Messages sent in this system use system ASCII headers to define 
message properties. Many kinds of content can be sent using MIME. Messages are sent using 
SMTP, which works by making a TCP connection from the source host to the destination host 
and directly delivering the e-mail over the TCP connection. 


The other killer app for the Internet is the World Wide Web. The Web is a system for linking 
hypertext documents. Originally, each document was a page written in HTML with hyperlinks 


to other documents. Nowadays, XML is gradually starting to take over from HTML. Also, a large 
amount of content is dynamically generated, using server-side scripts (PHP, JSP, and ASP), as 


well as clientside scripts (notably JavaScript). A browser can display a document by 
establishing a TCP connection to its server, asking for the document, and then closing the 
connection. These request messages contain a variety of headers for providing additional 
information. Caching, replication, and content delivery networks are widely used to enhance 
Web performance. 


The wireless Web is just getting started. The first systems are WAP and i-mode, each with 
small screens and limited bandwidth, but the next generation will be more powerful. 


Multimedia is also a rising star in the networking firmament. It allows audio and video to be 
digitized and transported electronically for display. Audio requires less bandwidth, so it is 
further along. Streaming audio, Internet radio, and voice over IP are a reality now, with new 
applications coming along all the time. Video on demand is an up-and-coming area in which 
there is great interest. Finally, the MBone is an experimental, worldwide digital live television 
service sent over the Internet. 


Chapter 8. Network Security 
8.1 Cryptography 


Cryptography comes from the Greek words for "secret writing." 

Professionals make a distinction between ciphers and codes. A cipher is a character-for- 
character or bit-for-bit transformation, without regard to the linguistic structure of the 
message. In contrast, a code replaces one word with another word or symbol. Codes are not 
used any more, although they have a glorious history. The most successful code ever devised 
was used by the U.S. armed forces during World War II in the Pacific. They simply had Navajo 
Indians talking to each other using specific Navajo words for military terms, for example chay- 
dagahi-nail-tsaidi (literally: tortoise killer) for antitank weapon. The Navajo language is highly 
tonal, exceedingly complex, and has no written form. And not a single person in Japan knew 
anything about it. 


In September 1945, the San Diego Union described the code by saying "For three years, 
wherever the Marines landed, the Japanese got an earful of strange gurgling noises 
interspersed with other sounds resembling the call of a Tibetan monk and the sound of a hot 
water bottle being emptied." The Japanese never broke the code and many Navajo code 
talkers were awarded high military honors for extraordinary service and bravery. The fact that 
the U.S. broke the Japanese code but the Japanese never broke the Navajo code played a 
crucial role in the American victories in the Pacific. 


8.1.1 Introduction to Cryptography 


Historically, four groups of people have used and contributed to the art of cryptography: the 
military, the diplomatic corps, diarists, and lovers. Of these, the military has had the most 
important role and has shaped the field over the centuries. Within military organizations, the 
messages to be encrypted have traditionally been given to poorly-paid, low-level code clerks 
for encryption and transmission. The sheer volume of messages prevented this work from 
being done by a few elite specialists. 


Until the advent of computers, one of the main constraints on cryptography had been the 
ability of the code clerk to perform the necessary transformations, often on a battlefield little 
equipment. An additional constraint has been the difficulty in switching over quickly from one 
cryptographic method to another one, since this entails retraining a large number of people. 
However, the danger of a code clerk being captured by the enemy has made it essential to be 
able to change the cryptographic method instantly if need be. These conflicting requirements 
have given rise to the model of Fig. 8-2. 


Figure 8-2. The encryption model (for a symmetric-key cipher). 
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The messages to be encrypted, known as the plaintext, are transformed by a function that is 
parameterized by a key. The output of the encryption process, known as the ciphertext, is 
then transmitted, often by messenger or radio. We assume that the enemy, or intruder, hears 
and accurately copies down the complete ciphertext. However, unlike the intended recipient, 
he does not know what the decryption key is and so cannot decrypt the ciphertext easily. 
Sometimes the intruder can not only listen to the communication channel (passive intruder) 
but can also record messages and play them back later, inject his own messages, or modify 
legitimate messages before they get to the receiver (active intruder). The art of breaking 
ciphers, called cryptanalysis, and the art devising them (cryptography) is collectively known 
as cryptology. 


It will often be useful to have a notation for relating plaintext, ciphertext, and keys. We will 
use C = Ex(P) to mean that the encryption of the plaintext P using key K gives the ciphertext 
C. Similarly, P = D«(C) represents the decryption of C to get the plaintext again. It then follows 
that 


Dy(E&(CP )) = P 


This notation suggests that E and D are just mathematical functions, which they are. The only 
tricky part is that both are functions of two parameters, and we have written one of the 
parameters (the key) as a subscript, rather than as an argument, to distinguish it from the 
message. 


A fundamental rule of cryptography is that one must assume that the cryptanalyst knows the 
methods used for encryption and decryption. In other words, the cryptanalyst knows how the 
encryption method, E, and decryption, D,of Fig. 8-2 work in detail. The amount of effort 
necessary to invent, test, and install a new algorithm every time the old method is 
compromised (or thought to be compromised) has always made it impractical to keep the 
encryption algorithm secret. Thinking it is secret when it is not does more harm than good. 


This is where the key enters. The key consists of a (relatively) short string that selects one of 
many potential encryptions. In contrast to the general method, which may only be changed 


every few years, the key can be changed as often as required. Thus, our basic model is a 
stable and publicly-known general method parameterized by a secret and easily changed key. 
The idea that the cryptanalyst knows the algorithms and that the secrecy lies exclusively in the 
keys is called Kerckhoff's principle, named after the Flemish military cryptographer Auguste 
Kerckhoff who first stated it in 1883 (Kerckhoff, 1883). Thus, we have: 


Kerckhoff's principle: All algorithms must be public; only the keys are secret 


The nonsecrecy of the algorithm cannot be emphasized enough. Trying to keep the algorithm 
secret, known in the trade as security by obscurity, never works. Also, by publicizing the 
algorithm, the cryptographer gets free consulting from a large number of academic 
cryptologists eager to break the system so they can publish papers demonstrating how smart 
they are. If many experts have tried to break the algorithm for 5 years after its publication and 
no one has succeeded, it is probably pretty solid. 


Since the real secrecy is in the key, its length is a major design issue. Consider a simple 
combination lock. The general principle is that you enter digits in sequence. Everyone knows 
this, but the key is secret. A key length of two digits means that there are 100 possibilities. A 
key length of three digits means 1000 possibilities, and a key length of six digits means a 
million. The longer the key, the higher the work factor the cryptanalyst has to deal with. The 
work factor for breaking the system by exhaustive search of the key space is exponential in 
the key length. Secrecy comes from having a strong (but public) algorithm and a long key. To 


prevent your kid brother from reading your e-mail, 64-bit keys will do. For routine commercial 
use, at least 128 bits should be used. To keep major governments at bay, keys of at least 256 
bits, preferably more, are needed. 


From the cryptanalyst's point of view, the cryptanalysis problem has three principal variations. 
When he has a quantity of ciphertext and no plaintext, he is confronted with the ciphertext- 
only problem. The cryptograms that appear in the puzzle section of newspapers pose this kind 
of problem. When the cryptanalyst has some matched ciphertext and plaintext, the problem is 
called the known plaintext problem. Finally, when the cryptanalyst has the ability to encrypt 
pieces of plaintext of his own choosing, we have the chosen plaintext problem. Newspaper 
cryptograms could be broken trivially if the cryptanalyst were allowed to ask such questions 
as: What is the encryption of ABCDEFGHIJKL? 


Novices in the cryptography business often assume that if a cipher can withstand a ciphertext- 
only attack, it is secure. This assumption is very naive. In many cases the cryptanalyst can 
make a good guess at parts of the plaintext. For example, the first thing many computers say 
when you call them up is login: . Equipped with some matched plaintext-ciphertext pairs, the 
cryptanalyst's job becomes much easier. To achieve security, the cryptographer should be 
conservative and make sure that the system is unbreakable even if his opponent can encrypt 
arbitrary amounts of chosen plaintext. 


Encryption methods have historically been divided into two categories: substitution ciphers and 
transposition ciphers. We will now deal with each of these briefly as background information 
for modern cryptography. 


8.1.2 Substitution Ciphers 


In a substitution cipher each letter or group of letters is replaced by another letter or group 
of letters to disguise it. One of the oldest known ciphers is the Caesar cipher, attributed to 
Julius Caesar. In this method, a becomes D, b becomes E, c becomes F, ... , and z becomes C. 
For example, attack becomes DWWDEFN. In examples, plaintext will be given in lower case 
letters, and ciphertext in upper case letters. 


A slight generalization of the Caesar cipher allows the ciphertext alphabet to be shifted by k 
letters, instead of always 3. In this case k becomes a key to the general method of circularly 


shifted alphabets. The Caesar cipher may have fooled Pompey, but it has not fooled anyone 
since. 


The next improvement is to have each of the symbols in the plaintext, say, the 26 letters for 
simplicity, map onto some other letter. For example, 


plaintext: abcdefghijklmnopqrstuvwxyz 
ciphertext: QWERTYUIOPASDFGHJKLZXCVBNM 


The general system of symbol-for-symbol substitution is called a monoalphabetic 
substitution, with the key being the 26-letter string corresponding to the full alphabet. For 
the key above, the plaintext attack would be transformed into the ciphertext QZZQEA. 


At first glance this might appear to be a safe system because although the cryptanalyst knows 
the general system (letter-for-letter substitution), he does not know which of the 26! ~4 x 
107° possible keys is in use. In contrast with the Caesar cipher, trying all of them is not a 
promising approach. Even at 1 nsec per solution, a computer would take 10!? years to try all 
the keys. 


Nevertheless, given a surprisingly small amount of ciphertext, the cipher can be broken easily. 
The basic attack takes advantage of the statistical properties of natural languages. In English, 


for example, e is the most common letter, followed by t, o, a, n, i, etc. The most common two- 
letter combinations, or digrams, are th, in, er, re, and an. The most common three-letter 
combinations, or trigrams, are the, ing, and, and ion. 


A cryptanalyst trying to break a monoalphabetic cipher would start out by counting the relative 
frequencies of all letters in the ciphertext. Then he might tentatively assign the most common 
one to e and the next most common one to t. He would then look at trigrams to find a 
common one of the form tXe, which strongly suggests that X is ^. Similarly, if the pattern thYt 
occurs frequently, the Y probably stands for a. With this information, he can look for a 
frequently occurring trigram of the form aZW, which is most likely and. By making guesses at 
common letters, digrams, and trigrams and knowing about likely patterns of vowels and 
consonants, the cryptanalyst builds up a tentative plaintext, letter by letter. 


Another approach is to guess a probable word or phrase. For example, consider the following 
ciphertext from an accounting firm (blocked into groups of five characters): 


CTBMN BYCTC BTJDS OXBNS GSTJC BTSWX CTOTZ COVUJ 
OJSGS TJOZZ2 MNOJS VLNSX VSZJU JDSTS JQUUS JUBXJ 
DSKSU JSNTK BGAQJ ZBGYO TLCTZ BNYBN QJSW 


A likely word in a message from an accounting firm is financial. Using our knowledge that 
financial has a repeated letter (/), with four other letters between their occurrences, we look 
for repeated letters in the ciphertext at this spacing. We find 12 hits, at positions 6, 15, 27, 
31, 42, 48, 56, 66, 70, 71, 76, and 82. However, only two of these, 31 and 42, have the next 
letter (corresponding to n in the plaintext) repeated in the proper place. Of these two, only 31 
also has the a correctly positioned, so we know that financial begins at position 30. From this 
point on, deducing the key is easy by using the frequency statistics for English text. 


8.1.3 Transposition Ciphers 


Substitution ciphers preserve the order of the plaintext symbols but disguise them. 
Transposition ciphers, in contrast, reorder the letters but do not disguise them. Figure 8-3 
depicts a common transposition cipher, the columnar transposition. The cipher is keyed by a 
word or phrase not containing any repeated letters. In this example, MEGABUCK is the key. 


The purpose of the key is to number the columns, column 1 being under the key letter closest 
to the start of the alphabet, and so on. The plaintext is written horizontally, in rows, padded to 
fill the matrix if need be. The ciphertext is read out by columns, starting with the column 
whose key letter is the lowest. 


Figure 8-3. A transposition cipher. 
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To break a transposition cipher, the cryptanalyst must first be aware that he is dealing with a 
transposition cipher. By looking at the frequency of E, T, A, O, I, N, etc., it is easy to see if 
they fit the normal pattern for plaintext. If so, the cipher is clearly a transposition cipher, 
because in such a cipher every letter represents itself, keeping the frequency distribution 
intact. 


The next step is to make a guess at the number of columns. In many cases a probable word or 
phrase may be guessed at from the context. For example, suppose that our cryptanalyst 
suspects that the plaintext phrase milliondollars occurs somewhere in the message. Observe 
that digrams MO, IL, LL, LA, IR and OS occur in the ciphertext as a result of this phrase 
wrapping around. The ciphertext letter O follows the ciphertext letter M (i.e., they are 
vertically adjacent in column 4) because they are separated in the probable phrase by a 
distance equal to the key length. If a key of length seven had been used, the digrams MD, IO, 
LL, LL, IA, OR, and NS would have occurred instead. In fact, for each key length, a different 
set of digrams is produced in the ciphertext. By hunting for the various possibilities, the 
cryptanalyst can often easily determine the key length. 


The remaining step is to order the columns. When the number of columns, k, is small, each of 
the k(k - 1) column pairs can be examined to see if its digram frequencies match those for 
English plaintext. The pair with the best match is assumed to be correctly positioned. Now 
each remaining column is tentatively tried as the successor to this pair. The column whose 
digram and trigram frequencies give the best match is tentatively assumed to be correct. The 
predecessor column is found in the same way. The entire process is continued until a potential 
ordering is found. Chances are that the plaintext will be recognizable at this point (e.g., if 
milloin occurs, it is clear what the error is). 


Some transposition ciphers accept a fixed-length block of input and produce a fixed-length 
block of output. These ciphers can be completely described by giving a list telling the order in 
which the characters are to be output. For example, the cipher of Fig. 8-3 can be seen as a 64 
character block cipher. Its output is 4, 12, 20, 28, 36, 44, 52, 60, 5, 13,... , 62. In other 
words, the fourth input character, a, is the first to be output, followed by the twelfth, f, and so 
on. 


8.1.4 One-Time Pads 


Constructing an unbreakable cipher is actually quite easy; the technique has been known for 
decades. First choose a random bit string as the key. Then convert the plaintext into a bit 
string, for example by using its ASCII representation. Finally, compute the XOR (eXclusive OR) 


of these two strings, bit by bit. The resulting ciphertext cannot be broken, because in a 
sufficiently large sample of ciphertext, each letter will occur equally often, as will every digram, 
every trigram, and so on. This method, known as the one-time pad, is immune to all present 
and future attacks no matter how much computational power the intruder has. The reason 
derives from information theory: there is simply no information in the message because all 
possible plaintexts of the given length are equally likely. 


An example of how one-time pads are used is given in Fig. 8-4. First, message 1, "I love you." 
is converted to 7-bit ASCII. Then a one-time pad, pad 1, is chosen and XORed with the 
message to get the ciphertext. A cryptanalyst could try all possible one-time pads to see what 
plaintext came out for each one. For example, the one-time pad listed as pad 2 in the figure 
could be tried, resulting in plaintext 2, "Elvis lives", which may or may not be plausible (a 
subject beyond the scope of this book). In fact, for every 11-character ASCII plaintext, there is 
a one-time pad that generates it. That is what we mean by saying there is no information in 
the ciphertext: you can get any message of the correct length out of it. 


Figure 8-4. The use of a one-time pad for encryption and the possibility 
of getting any possible plaintext from the ciphertext by the use of 
some other pad. 


Message 1: 1001001 01700000 1101100 11707117 1110110 1100101 0100000 1111007 11011117 1110101 0101110 
Pad 1: 1010010 1001011 1110010 101010! 1010010 1100011 0001011 0101010 1010111 1100110 0101011 
Ciphertext: — 0011011 1101011 0011110 0111010 0100100 0000110 0101011 1010011 0111000 0010011 0000101 


Pad 2: 1011110 0000111 1101000 1010011 1010111 0100110 1000111 0111010 1001110 1110110 1110110 
Plaintext 2: 1000101 1101100 1110110 1101001 1110011 0100000 1101100 1101001 1110110 1100101 1110011 


One-time pads are great in theory but have a number of disadvantages in practice. To start 
with, the key cannot be memorized, so both sender and receiver must carry a written copy 
with them. If either one is subject to capture, written keys are clearly undesirable. 
Additionally, the total amount of data that can be transmitted is limited by the amount of key 
available. If the spy strikes it rich and discovers a wealth of data, he may find himself unable 
to transmit it back to headquarters because the key has been used up. Another problem is the 
sensitivity of the method to lost or inserted characters. If the sender and receiver get out of 
synchronization, all data from then on will appear garbled. 


With the advent of computers, the one-time pad might potentially become practical for some 
applications. The source of the key could be a special DVD that contains several gigabytes of 
information and if transported in a DVD movie box and prefixed by a few minutes of video, 
would not even be suspicious. Of course, at gigabit network speeds, having to insert a new 
DVD every 30 sec could become tedious. And the DVDs must be personally carried from the 
sender to the receiver before any messages can be sent, which greatly reduces their practical 
utility. 


Quantum Cryptography 


Interestingly, there may be a solution to the problem of how to transmit the one-time pad over 
the network, and it comes from a very unlikely source: quantum mechanics. This area is still 
experimental, but initial tests are promising. If it can be perfected and be made efficient, 
virtually all cryptography will eventually be done using one-time pads since they are provably 
secure. Below we will briefly explain how this method, quantum cryptography, works. In 
particular, we will describe a protocol called BB84 after its authors and publication year 
(Bennet and Brassard, 1984). 


A user, Alice, wants to establish a one-time pad with a second user, Bob. Alice and Bob are 
called principals, the main characters in our story. For example, Bob is a banker with whom 
Alice would like to do business. The names "Alice" and "Bob" have been used for the principals 


in virtually every paper and book on cryptography in the past decade. Cryptographers love 
tradition. If we were to use "Andy" and "Barbara" as the principals, no one would believe 
anything in this chapter. So be it. 


If Alice and Bob could establish a one-time pad, they could use it to communicate securely. 
The question is: How can they establish it without previously exchanging DVDs? We can 
assume that Alice and Bob are at opposite ends of an optical fiber over which they can send 
and receive light pulses. However, an intrepid intruder, Trudy, can cut the fiber to splice in an 
active tap. Trudy can read all the bits in both directions. She can also send false messages in 
both directions. The situation might seem hopeless for Alice and Bob, but quantum 
cryptography can shed some new light on the subject. 


Quantum cryptography is based on the fact that light comes in little packets called photons, 
which have some peculiar properties. Furthermore, light can be polarized by being passed 
through a polarizing filter, a fact well known to both sunglasses wearers and photographers. If 
a beam of light (i.e., a stream of photons) is passed through a polarizing filter, all the photons 
emerging from it will be polarized in the direction of the filter's axis (e.g., vertical). If the beam 
is now passed through a second polarizing filter, the intensity of the light emerging from the 
second filter is proportional to the square of the cosine of the angle between the axes. If the 
two axes are perpendicular, no photons get through. The absolute orientation of the two filters 
does not matter; only the angle between their axes counts. 


To generate a one-time pad, Alice needs two sets of polarizing filters. Set one consists of a 
vertical filter and a horizontal filter. This choice is called a rectilinear basis. A basis (plural: 
bases) is just a coordinate system. The second set of filters is the same, except rotated 45 
degrees, so one filter runs from the lower left to the upper right and the other filter runs from 
the upper left to the lower right. This choice is called a diagonal basis. Thus, Alice has two 


bases, which she can rapidly insert into her beam at will. In reality, Alice does not have four 
separate filters, but a crystal whose polarization can be switched electrically to any of the four 
allowed directions at great speed. Bob has the same equipment as Alice. The fact that Alice 
and Bob each have two bases available is essential to quantum cryptography. 


For each basis, Alice now assigns one direction as 0 and the other as 1. In the example 
presented below, we assume she chooses vertical to be 0 and horizontal to be 1. 
Independently, she also chooses lower left to upper right as 0 and upper left to lower right as 
1. She sends these choices to Bob as plaintext. 


Now Alice picks a one-time pad, for example based on a random number generator (a complex 
subject all by itself). She transfers it bit by bit to Bob, choosing one of her two bases at 
random for each bit. To send a bit, her photon gun emits one photon polarized appropriately 
for the basis she is using for that bit. For example, she might choose bases of diagonal, 
rectilinear, rectilinear, diagonal, rectilinear, etc. To send her one-time pad of 
1001110010100110 with these bases, she would send the photons shown in Fig. 8-5(a). Given 
the one-time pad and the sequence of bases, the polarization to use for each bit is uniquely 
determined. Bits sent one photon at a time are called qubits. 


Figure 8-5. An example of quantum cryptography. 
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Bob does not know which bases to use, so he picks one at random for each arriving photon 
and just uses it, as shown in Fig. 8-5(b). If he picks the correct basis, he gets the correct bit. 
If he picks the incorrect basis, he gets a random bit because if a photon hits a filter polarized 
at 45 degrees to its own polarization, it randomly jumps to the polarization of the filter or to a 
polarization perpendicular to the filter with equal probability. This property of photons is 
fundamental to quantum mechanics. Thus, some of the bits are correct and some are random, 
but Bob does not know which are which. Bob's results are depicted in Fig. 8-5(c). 


How does Bob find out which bases he got right and which he got wrong? He simply tells Alice 
which basis he used for each bit in plaintext and she tells him which are right and which are 
wrong in plaintext, as shown in Fig. 8-5(d). From this information both of them can build a bit 
string from the correct guesses, as shown in Fig. 8-5(e). On the average, this bit string will be 
half the length of the original bit string, but since both parties know it, they can use it as a 
one-time pad. All Alice has to do is transmit a bit string slightly more than twice the desired 
length and she and Bob have a one-time pad of the desired length. Problem solved. 


But wait a minute. We forgot Trudy. Suppose that she is curious about what Alice has to say 


and cuts the fiber, inserting her own detector and transmitter. Unfortunately for her, she does 
not know which basis to use for each photon either. The best she can do is pick one at random 
for each photon, just as Bob does. An example of her choices is shown in Fig. 8-5(f). When 
Bob later reports (in plaintext) which bases he used and Alice tells him (in plaintext) which 
ones are correct, Trudy now knows when she got it right and when she got it wrong. In Fig. 8- 
5 she got it right for bits 0, 1, 2, 3, 4, 6, 8, 12, and 13. But she knows from Alice's reply in Fig. 
8-5(d) that only bits 1, 3, 7, 8, 10, 11, 12, and 14 are part of the one-time pad. For four of 
these bits (1, 3, 8, and 12), she guessed right and captured the correct bit. For the other four 
(7, 10, 11, and 14) she guessed wrong and does not know the bit transmitted. Thus, Bob 
knows the one-time pad starts with 01011001, from Fig. 8-5(e) but all Trudy has is 01?1??0?, 
from Fig. 8-5(q). 


Of course, Alice and Bob are aware that Trudy may have captured part of their one-time pad, 
so they would like to reduce the information Trudy has. They can do this by performing a 
transformation on it. For example, they could divide the one-time pad into blocks of 1024 bits 
and square each one to form a 2048-bit number and use the concatenation of these 2048-bit 
numbers as the one-time pad. With her partial knowledge of the bit string transmitted, Trudy 
has no way to generate its square and so has nothing. The transformation from the original 
one-time pad to a different one that reduces Trudy's knowledge is called privacy 


amplification. In practice, complex transformations in which every output bit depends on 
every input bit are used instead of squaring. 


Poor Trudy. Not only does she have no idea what the one-time pad is, but her presence is not 
a secret either. After all, she must relay each received bit to Bob to trick him into thinking he 
is talking to Alice. The trouble is, the best she can do is transmit the qubit she received, using 
the polarization she used to receive it, and about half the time she will be wrong, causing 
many errors in Bob's one-time pad. 


When Alice finally starts sending data, she encodes it using a heavy forward-error-correcting 
code. From Bob's point of view, a 1-bit error in the one-time pad is the same as a 1-bit 
transmission error. Either way, he gets the wrong bit. If there is enough forward error 
correction, he can recover the original message despite all the errors, but he can easily count 
how many errors were corrected. If this number is far more than the expected error rate of the 
equipment, he knows that Trudy has tapped the line and can act accordingly (e.g., tell Alice to 
switch to a radio channel, call the police, etc.). If Trudy had a way to clone a photon so she 
had one photon to inspect and an identical photon to send to Bob, she could avoid detection, 
but at present no way to clone a photon perfectly is known. But even if Trudy could clone 
photons, the value of quantum cryptography to establish one-time pads would not be reduced. 


Although quantum cryptography has been shown to operate over distances of 60 km of fiber, 
the equipment is complex and expensive. Still, the idea has promise. For more information 
about quantum cryptography, see (Mullins, 2002). 


8.1.5 Two Fundamental Cryptographic Principles 


Although we will study many different cryptographic systems in the pages ahead, two 
principles underlying all of them are important to understand. 


Redundancy 


The first principle is that all encrypted messages must contain some redundancy, that is, 
information not needed to understand the message. An example may make it clear why this is 
needed. Consider a mail-order company, The Couch Potato (TCP), with 60,000 products. 
Thinking they are being very efficient, TCP's programmers decide that ordering messages 
should consist of a 16-byte customer name followed by a 3-byte data field (1 byte for the 
quantity and 2 bytes for the product number). The last 3 bytes are to be encrypted using a 
very long key known only by the customer and TCP. 


At first this might seem secure, and in a sense it is because passive intruders cannot decrypt 
the messages. Unfortunately, it also has a fatal flaw that renders it useless. Suppose that a 
recently-fired employee wants to punish TCP for firing her. Just before leaving, she takes the 
customer list with her. She works through the night writing a program to generate fictitious 
orders using real customer names. Since she does not have the list of keys, she just puts 
random numbers in the last 3 bytes, and sends hundreds of orders off to TCP. 


When these messages arrive, TCP's computer uses the customer's name to locate the key and 
decrypt the message. Unfortunately for TCP, almost every 3-byte message is valid, so the 
computer begins printing out shipping instructions. While it might seem odd for a customer to 
order 837 sets of children's swings or 540 sandboxes, for all the computer knows, the 
customer might be planning to open a chain of franchised playgrounds. In this way an active 
intruder (the ex-employee) can cause a massive amount of trouble, even though she cannot 
understand the messages her computer is generating. 


This problem can be solved by the addition of redundancy to all messages. For example, if 
order messages are extended to 12 bytes, the first 9 of which must be zeros, then this attack 


no longer works because the ex-employee can no longer generate a large stream of valid 
messages. The moral of the story is that all messages must contain considerable redundancy 
so that active intruders cannot send random junk and have it be interpreted as a valid 
message. 


However, adding redundancy also makes it easier for cryptanalysts to break messages. 
Suppose that the mail order business is highly competitive, and The Couch Potato's main 
competitor, The Sofa Tuber, would dearly love to know how many sandboxes TCP is selling. 
Consequently, they have tapped TCP's telephone line. In the original scheme with 3-byte 
messages, cryptanalysis was nearly impossible, because after guessing a key, the cryptanalyst 
had no way of telling whether the guess was right. After all, almost every message is 
technically legal. With the new 12-byte scheme, it is easy for the cryptanalyst to tell a valid 
message from an invalid one. Thus, we have 


Cryptographic principle 1: Messages must contain some redundancy 


In other words, upon decrypting a message, the recipient must be able to tell whether it is 
valid by simply inspecting it and perhaps performing a simple computation. This redundancy is 
needed to prevent active intruders from sending garbage and tricking the receiver into 
decrypting the garbage and acting on the "plaintext." However, this same redundancy makes 
it much easier for passive intruders to break the system, so there is some tension here. 
Furthermore, the redundancy should never be in the form of n zeros at the start or end of a 
message, since running such messages through some cryptographic algorithms gives more 
predictable results, making the cryptanalysts' job easier. A CRC polynomial is much better than 
a run of Os since the receiver can easily verify it, but it generates more work for the 


cryptanalyst. Even better is to use a cryptographic hash, a concept we will explore later. 


Getting back to quantum cryptography for a moment, we can also see how redundancy plays a 
role there. Due to Trudy's interception of the photons, some bits in Bob's one-time pad will be 
wrong. Bob needs some redundancy in the incoming messages to determine that errors are 
present. One very crude form of redundancy is repeating the message two times. If the two 
copies are not identical, Bob knows that either the fiber is very noisy or someone is tampering 
with the transmission. Of course, sending everything twice is overkill; a Hamming or Reed- 
Solomon code is a more efficient way to do error detection and correction. But it should be 
clear that some redundancy is needed to distinguish a valid message from an invalid message, 
especially in the face of an active intruder. 


Freshness 


The second cryptographic principle is that some measures must be taken to ensure that each 
message received can be verified as being fresh, that is, sent very recently. This measure is 
needed to prevent active intruders from playing back old messages. If no such measures were 
taken, our ex-employee could tap TCP's phone line and just keep repeating previously sent 
valid messages. Restating this idea we get: 


Cryptographic principle 2: Some method is needed to foil replay attacks 


One such measure is including in every message a timestamp valid only for, say, 10 seconds. 
The receiver can then just keep messages around for 10 seconds, to compare newly arrived 
messages to previous ones to filter out duplicates. Messages older than 10 seconds can be 
thrown out, since any replays sent more than 10 seconds later will be rejected as too old. 
Measures other than timestamps will be discussed later 


